Sunday, October 14, 2012

More multiple English data

In several of the earlier entries, we described the various English paraphrase data that we've collected. At long last, that data has been collected into tab-separated files with the following information:

  1. Corpus name;
  2. Document ID;
  3. Segment ID;
  4. Source language;
  5. Genre;
  6. Whether each segment is a headline or sentence;
  7. Number of English translations;
  8. Translations.
So this is some cool data. This post is going to talk about what we've done with it so far, and other ideas for doing things with it.

Saturday, September 29, 2012

Some parsing statistics

So we finally got the parser to parse the data using the same grammar that it was generated with. We used a Hiero grammar to translate the NIST09 Urdu test set into English. Then we used the same grammar that was used as the translation model to try to parse the Urdu sentences paired with their 1-best MT outputs. We were successful.

I derived the following graph from the log file of the parsing run:


Another quick thing to report is that we've started at least 1 fold of the 10-fold cross validation experiment (more detail on that later). As a quick summary, we split the Urdu--English training data into 10 folds, then trained a Hiero grammar on 9 of the folds and parsed the held-out data. Separately, we're also parsing the training data that we derived the grammar from.

I plan to write another post soon describing the error analysis we hope to perform on this parsing data. In the meantime, here are some raw numbers:

Training on folds 1-9, parsing fold 0: 5928 sentences, 1095 parsed successfully.
Training on folds 1-9, parsing folds 1-9: processed 10919 sentences so far, 4764 parsed successfully.

Tuesday, September 25, 2012

Multiple English translations: amount of data

Below the fold, a table of the amount of data in the datasets listed in my previous entry. I kind of wish blogger would really let me put in a table.

Thursday, September 20, 2012

Debugging a Parser (Part 3)

I spoke too soon about the model. With Chris's help, we noticed that the translation grammar still contained many abstract rules of the form

[X] ||| [X,1] [X,2] ||| { ... some target side ... }

This causes a problem, because without pruning in the first-pass parse, we're looking for all possible parse trees licensed by the source side of the grammar for the source sentence. A dimly-remembered result from Jason's NLP class tells us that, when we have a "promiscuous" CFG with the production rules

S -> X
X -> X X
X -> a

where a is any terminal symbol, a sentence of length N has a number of possible parse trees equal to the Nth Catalan number. That's a lot of parses.

Tuesday, September 18, 2012

Debugging a Parser (Part 2)

It's not the model! Recall that at the end of the first part of this series, we decided that either there was still a cycle in the model allowing building an unbounded chart, or else there was something wrong with the pruning. Well, I've managed to get a run going where it doesn't take 12 hours to parse a four-word sentence, and the culprit was quite a surprise.

Friday, September 14, 2012

Debugging a Parser (Part 1)

I've been trying to implement a synchronous parser for a long time. I thought that I had it working during the spring, but three months later I can't reproduce the results. The simplest thing to see if the synchronous parser in Joshua is working seems to be this:
  1. With a given SCFG, translate a set of sentences.
  2. Try to parse the resulting source--target sentence pairs.
Presumably, since the target sentences were generated by the SCFG, the parser should be able to recognize all the sentences pairs that are produced.

Thursday, September 13, 2012

Data with multiple English translations

I'm hunting down all the datasets I can find with multiple English reference translations of some foreign data. We're making the assumption that multiple translations of the same data are going to be paraphrases of each other.