- Corpus name;
- Document ID;
- Segment ID;
- Source language;
- Genre;
- Whether each segment is a headline or sentence;
- Number of English translations;
- Translations.
If we look at just the LDC data, there's about 52,000 individual segments. However, if we take all the possible sentence pairs (most LDC corpora have four parallel English translations) the number of training examples is closer to 250,000. This is a decent-sized corpus -- somewhat on the small side. It's comparable in size to the NIST Urdu--English 2009 data, but much smaller than parallel corpora for well-studied language pairs.
So if we use this data to train up, say, and SCFG, what can we do with it?
One thing we've been studying so far is determining how many held-out sentence pairs can be re-parsed by the grammar we train. But this number doesn't tell us very much; for example, on one held-out fold, we were able to parse 173/1067, or about 17% of the sentence pairs. It would be easy to just conclude "the model doesn't explain new sentence pairs very well." But we want to look closer and figure out why. Let's look at some un-parseable pairs.
perhaps i was really helpless ||| perhaps , i was compelled to do so
In this one, we're missing any rule to handle the word "helpless" in the input sentence. "Compelled" is almost always matched up with "required" in the training data, so there's no real way to handle that word.
on the contrary , they 're flourishing at the expense of pakistan . ||| they are still plundering the same pakistan .
In this sentence pair, we can't handle the phrase "at the expense of". Every rule in the grammar that sees that phrase leaves it in. Note more generally that this formalism doesn't allow epsilon rules: we can't arbitrarily delete or insert phrases. That could be a problem with explaining paraphrases (or doing tasks like compression).
I think this lack of epsilon rules might be a bigger problem than we think: if rules cannot in general remove information, that means that to explain a sentence pair, both sentences need to have almost identical informational content. We can't drop information.
rescue teams had lost hope of finding anyone alive . ||| the rescue teams
had lost hope of finding people alive .
This is an interesting pair that indicates the problem. The two sentence are identical expect for the first "the" in the second sentence. But under our model there's no way to insert this (semantically useless) word.
As you can see, the grammars are not really horrible. If you inspect rules, you can see that they do in general encode paraphrase pairs. The problem is that the new data we want to explain is not always paraphrased exactly how we would like. The ideal input for a model like this would be two sentences, where we can break the sentences into phrases, and there's a one-to-one correspondence between the phrases. But that happens pretty rarely.
The next thing to do with some of this data is to classify errors. We've already thought of some problems:
- The two sentences don't carry the same information. If we think about from a MacCartney-esque natural logic point of view, what we mean is that there's some phrase in one sentence that doesn't stand in any specific logical relation to any phrase in the other sentence.
- Syntactic divergence -- structural differences that our model can't handle. The inserted "the" is an example. We should note here that the grammars we've been using are Hiero models, but with a richer syntactic model, this will be a richer source of errors -- there might be paraphrase rules that aren't in the grammar if they don't fit into our syntactic model (cf. Koehn (2003) about restricting phrases to syntactic constituents).
- Bad luck: one pair failed because a "someone" -> "some one" rule wasn't available, which is too bad. This indicates possible submodules to the translation process: tokenization, etc.
One interesting way of using the 250,000 English-English sentence pairs that you have assembled would be to sort them by some criterion. For instance, Juri and Courtney were working on sentence compression, so you could find interesting data for them by sorting the sentence pairs based on their length ratio.
ReplyDeleteThis data might be useful for our upcoming DEFT grant because it could be used for RTE. You might also try finding interesting examples for RTE by looking for sentence pairs with the least amount of word overlap. Bag of words models worked well in some of the original RTE tasks, and this might be a way of finding data that would require more sophisticated models. My assumption is that all data here are semantically equivalent, and that the entailment goes in both direction. However, from your examples, this might not be the case.
i was helpless ||| i was compelled to do something
I can possibly see that
i was compelled to do something --> i was helpless (because it was not my choice)
But definitely not
i was helpless --> i was compelled to do something (because I may have been unable to do anything)
Is there a methodology for making such judgments? What were the guidelines for classifying the original RTE sentences? It might be described here: http://u.cs.biu.ac.il/~dagan/publications/RTEChallenge.pdf
For your rescue teams example, shouldn't we have an SCFG rule that looks like this? NP -> NN NNS | the NN NNS
ReplyDeleteYes, we easily could have such a rule.
DeleteBut at the moment we may be filtering it out, since the source side contains no terminal symbols. Is that right? Does the filtering happen during grammar extraction or before synchronous parsing?
Delete