Tuesday, September 25, 2012

Multiple English translations: amount of data

Below the fold, a table of the amount of data in the datasets listed in my previous entry. I kind of wish blogger would really let me put in a table.
I put together a little SGM-to-raw-text script that just pulls out the text from the LDC data files. The following numbers are for all the references combined; divide by 4 to get the data for one reference.

  • Name - Lines - Words
  • LDC2010T10 Chinese - 3512 - 94774
  • LDC2010T10 Arabic - 2912 - 79580
  • LDC2010T11 Chinese - 3676 - 101129
  • LDC2010T11 Arabic - 2652 - 70012
  • LDC2010T12 Chinese - 6388 - 185587
  • LDC2010T12 Arabic - 4300 - 126906
  • LDC2010T14 Chinese - 4328 - 123474
  • LDC2010T14 Arabic - 4224 - 128111
  • LDC2010T17 (GALE part) Arabic - 1762 - 42574
  • LDC2010T17 (NIST part) Arabic - 6656 - 178877
  • LDC2010T17 (GALE part) Chinese - 2276 - 47987
  • LDC2010T21 (NIST part) Chinese - 6656 - 167380
  • LDC2010T23 Arabic - 5252 - 167779
  • LDC2010T23 Urdu - 7168 - 141638
  • LDC2002T01 (MTC) - 16881 - 467690
  • LDC2003T18 (MTA) - not available on COE
  • LDC2005T05 (MTA) - 4641 - 131209
  • Regina's data - 75106 - 682978

2 comments:

  1. Could you also add statistics about the portion of the MSR paraphrase corpus where the judges decided the sentence pairs were paraphrases?

    Also, it's worth noting that Trevor Cohn, Mirella Lapata and I created a manually word aligned corpus with 900 sentence pairs. It has 300 sentences from each of the LDC multiple translation corpora (Chinese source), the MSR paraphrase corpus, and Regina's multiple translations of classic French novels.

    Here's that data: http://staffwww.dcs.shef.ac.uk/people/T.Cohn/paraphrase_corpus.html

    Here's our Computational Linguistics article about it:
    http://cs.jhu.edu/~ccb/publications/constructing-corpora-for-paraphrase-systems.pdf

    ReplyDelete
  2. I have been wanting to outsource my marketing to other countries, but know that there is a language barrier. I am seeking a document translation agency to help seize this opportunity. Would you happen to know the price range would be for something like this?

    ReplyDelete