Sunday, October 14, 2012

More multiple English data

In several of the earlier entries, we described the various English paraphrase data that we've collected. At long last, that data has been collected into tab-separated files with the following information:

  1. Corpus name;
  2. Document ID;
  3. Segment ID;
  4. Source language;
  5. Genre;
  6. Whether each segment is a headline or sentence;
  7. Number of English translations;
  8. Translations.
So this is some cool data. This post is going to talk about what we've done with it so far, and other ideas for doing things with it.