As I noted, I'm pretty sure I've removed all the rules that might cause an unbounded chart; since we're working with a Hiero grammar, the only such abstract non-branching rule is
[X] ||| [X,1] ||| [X,1]
and I removed that one. But when re-running it with a removed rule, parsing a short sentence was still taking forever. So, if it's not a cycle in the rules, it must be problem with the pruning code, right? Well, not exactly. As I noted in an earlier post, I'm kind of skeptical that Joshua's pruning mechanism could be broken in any serious way, since we've been using the code in a lot of experiments for a very long time.
The breakthrough came when I remembered that, in fact, I had managed to parse a few sentences last week (in the course of trying the 10-fold cross validation) before we took a step back and decided to parse some translation data. What was the difference between the two runs? I decided to compare the Joshua configuration files that I was using. I had switched to the new configuration format that was introduced over the summer!
I restarted the run using an old-style configuration file that pointed to new model I'm using. So far, we've at least managed to parse the first sentence of the corpus, which is a lot better than we were doing before. The second sentence is quite long, so I don't know how long it will take to process, but with luck, we'll have a re-parsed corpus overnight, which will put us well on our way to having a robust parser, having completed the first of 4 experiments.
I'm putting together some theories about the new-style configuration:
- Maybe I didn't correctly set the LM to "none"? That would slow things down immeasurably if an LM of order 5 were used with exhaustive pruning.
- Maybe the pruning didn't get set correctly? I think in the new style, all the old pruning parameters have been reduced to 1 parameter, the pop limit for cube pruning. As I recall, setting this parameter to 0 will turn on exhaustive parsing. To be safe, I've set useCubePrune and useBeamAndThresholdPrune to false, which you used to have to do. If these weren't turned off using whatever new config-parsing code -- well, actually, that shouldn't hurt anything. We might not get 100% coverage with pruning, but that shouldn't hurt.
Are you sure that the version that you're using now has all of the changes that you made in the Spring? Remember that we debugged how you were marking a node a visited so that you didn't revisit it.
ReplyDeleteI am pretty sure that my parse branch got pulled into the central devel branch, so it should be okay, but I'll double check.
DeleteThe code seems correct.
Delete