Thursday, September 13, 2012

Data with multiple English translations

I'm hunting down all the datasets I can find with multiple English reference translations of some foreign data. We're making the assumption that multiple translations of the same data are going to be paraphrases of each other.

Chris suggested that I look for data used in the NIST OpenMT program. For this kind of stuff, the LDC is your best friend. First, I just did ctrl+F for "openmt". There's quite a bit of this data available when your'e willing to pay for it:

  • LDC2010T10 -- Chinese and Arabic newswire, 4 translations each
  • LDC2010T11 -- Chinese and Arabic newswire, 4 translations each
  • LDC2010T12 -- 150 each Chinese and Arabic newswire, and 29 Chinese "prepared speech", 4 translations each
  • LDC2010T14 -- 100 each Chinese and Arabic newswire, 4 translations each
  • LDC2010T17 -- 357 newswire documents, 4 translations each
  • LDC2010T21 -- 494 documents, Arabic, Chinese, English and Urdu, in four domains: newswire, broadcast, weblog, and newsgroup. Each document has 4 translations. I don't know whether the English documents referred to are English source documents (in which case, what language are they translated into?) or just the English translations of the other documents.
  • LDC2010T23 -- 373 documents, Arabic and Urdu newswire, broadcast and weblog. 4 translations each.
Also available are the Multiple-Translation corpora, created "to support the development of automatic means for evaluating translation quality." These corpora actually have some overlap with the sets listed above. According to the descriptions, the LDC solicited ten translations of Chinese data, and 11 translation of Arabic data. The Multiple-Translation Arabic (MTA) corpus is available as LDC2003T18 and LDC2005T05. The MTC (Chinese) corpus is LDC2002T01. They both appear to be newswire articles.

Searching for "gale", on the other hand, gives us only the training data for machine translation systems, as well as the MT output from the GALE evaluations and their human assessments. Those are not useful for us.

Another place to look is data released for various IWSLT tasks. I looked at the MT track for IWSLT2011 and 2012 (actually a very interesting task -- translating TED talks) but they only have 1 reference translation in English for the development data. (On a side note, the data is quite interesting and possibly worth looking into if we want to work on a different domain once in a while -- see the Web Inventory of Transcribed and Translated Talks.)

One other potential source is multiple translations of things like novels. Barzilay and McKeown (2001) put together a corpus of translations of 5 books (examples: Madame Bovary, Twenty Thousand Leagues Under the Sea) which, according to a footnote, should be available at http://www.cs.columbia.edu/~regina/par. Sadly, sometime in the last ten years that link died. Heck, Regina isn't even at Columbia anymore. I'm still looking around for that corpus, but I'm not spending too much effort on it, because it seems like that data is quite out-of-domain compared to the the stuff that we're working on.

Barzilay, Regina and Kathleen R. McKeown. Extracting Paraphrases from a Parallel Corpus. 2001. In Proc. of ACL.

4 comments:

  1. Thanks, Jonny. Here is information on the GALE evaluations (which were conducted by NIST): http://www.itl.nist.gov/iad/mig/tests/gale/

    You should click through and read the evaluation plans for each year to determine whether GALE had multiple translations or not. Since GALE employed HTER they may have decided that only a single reference was necessary.

    ReplyDelete
  2. Through my awesome URL transformation skills I found Regina's old data sets: http://people.csail.mit.edu/regina/par/

    ReplyDelete
  3. Could you also look up the Microsoft Paraphrase Corpus?

    ReplyDelete
  4. The overview papers for the 2004 and 2005 IWSLT campaign describe 16 references: http://www.is.cs.cmu.edu/iwslt2005/proceedings/CMU_01.pdf
    http://mt-archive.info/IWSLT-2004-Akiba.pdf
    It would be worth tracking those down too.

    ReplyDelete