Saturday, September 29, 2012

Some parsing statistics

So we finally got the parser to parse the data using the same grammar that it was generated with. We used a Hiero grammar to translate the NIST09 Urdu test set into English. Then we used the same grammar that was used as the translation model to try to parse the Urdu sentences paired with their 1-best MT outputs. We were successful.

I derived the following graph from the log file of the parsing run:


Another quick thing to report is that we've started at least 1 fold of the 10-fold cross validation experiment (more detail on that later). As a quick summary, we split the Urdu--English training data into 10 folds, then trained a Hiero grammar on 9 of the folds and parsed the held-out data. Separately, we're also parsing the training data that we derived the grammar from.

I plan to write another post soon describing the error analysis we hope to perform on this parsing data. In the meantime, here are some raw numbers:

Training on folds 1-9, parsing fold 0: 5928 sentences, 1095 parsed successfully.
Training on folds 1-9, parsing folds 1-9: processed 10919 sentences so far, 4764 parsed successfully.

1 comment:

  1. I'd like to see a comparison between the parsing times for a hiero grammar and an SAMT grammar. Could you generate that comparison? You can use the same scatterplot, just color the dots differently for the two grammars.

    ReplyDelete