In several of the earlier entries, we described the various English paraphrase data that we've collected. At long last, that data has been collected into tab-separated files with the following information:
- Corpus name;
- Document ID;
- Segment ID;
- Source language;
- Genre;
- Whether each segment is a headline or sentence;
- Number of English translations;
- Translations.
So this is some cool data. This post is going to talk about what we've done with it so far, and other ideas for doing things with it.