WE HAVE RELEASED 4 DATASETS

CLC FCE Dataset

The CLC FCE Dataset is a set of 1,244 exam scripts written by candidates sitting the Cambridge ESOL First Certificate in English (FCE) examination in 2000 and 2001.

The scripts are extracted from the Cambridge Learner Corpus (CLC), developed as a collaborative effort between Cambridge University Press and Cambridge Assessment.

For each exam script, the CLC FCE Dataset includes the original text written by the candidate (transcribed and anonymised, but otherwise unmodified) as well as marks, error annotation and essential demographic details including the candidate’s first language and age bracket.

Licence

The Dataset is released for non-commercial research and educational purposes under the following licence agreement:

  1. By downloading this dataset and licence, this licence agreement is entered into, effective this date, between you, the Licensee, and the University of Cambridge, the Licensor.
  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee.
  3. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes.
  4. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.
  5. The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the following publication: Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben, ‘A New Dataset and Method for Automatically Grading ESOL Texts’, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
  6. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
  7. The Licensor grants the Licensee this right to use the licensed dataset ‘as is’. Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever.
  8. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.

Download

You may download the CLC FCE Dataset if you agree to the licence above. Publications using it must include a reference to the following publication:

  • Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben, ‘A New Dataset and Method for Automatically Grading ESOL Texts’, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
DOWNLOAD

Dataset for error detection

A derived version of the dataset intended for error detection is available under the same licence.

DOWNLOAD

Error-annotated AN Combinations (COLING 2014)

We have released a dataset of adjective–noun (AN) combinations which, on the one hand, exemplify the typical errors committed by language learners in the choice of content words within such combinations, and, on the other hand, are challenging for an error detection and correction (EDC) system.

Starting from an analysis of the typical errors in AN combinations committed by language learners, we compiled a list of 61 adjectives that often present problems for learners. AN combinations involving those adjectives were then extracted from the unannotated part of the Cambridge Learner Corpus. We have focused on AN combinations previously unseen in the British National Corpus.

This set contains 798 AN combinations. Each combination is marked up as correct or incorrect, and, for the incorrect combinations, the locus of the error is identified (adjective, noun or both), as is the type of confusion involved (incorrect synonym, form-related word, or non-related word). The most appropriate corrections are included in the dataset.

Further details can be found in the document Error annotation in adjective–noun (AN) combinations.

Licence

The Dataset is released for non-commercial research and educational purposes under the following licence agreement:

  1. By downloading this dataset and licence, this licence agreement is entered into, effective this date, between you, the Licensee, and the University of Cambridge, the Licensor.
  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee.
  3. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes.
  4. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.
  5. The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the relevant publication(s) mentioned next to the download link.
  6. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
  7. The Licensor grants the Licensee this right to use the licensed dataset ‘as is’. Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever.
  8. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.

Download

You may download the dataset if you agree to the licence above. Publications using it should include a reference to the following publication:

  • Kochmar, Ekaterina and Ted Briscoe: ‘Detecting learner errors in the choice of content words using compositional distributional semantics’, Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014).
DOWNLOAD

CLC-FCE AN Combinations (RANLP 2013)

This dataset of adjective–noun (AN) combinations is extracted from the parsed version of the publicly-available CLC FCE Dataset (see first tab). The error coding is used to divide the set into two subsets — correctly used ANs and those that are annotated as errors due to inappropriate choice of an adjective or/and noun. For the ANs that are used correctly in some contexts and incorrectly in others, the most frequent annotation from the CLC is used. The dataset contains 4681 correct and 530 incorrect combinations.

The set of ANs is further divided into corpus-attested and corpus-unattested examples, where parsed version of the British National Corpus (BNC) is used for reference with the frequency threshold set to 3 occurrences in the corpus.

Both the CLC FCE Dataset and the BNC corpus are lemmatised, tagged and parsed using the RASP system (Briscoe et al., 2006; Andersen et al., 2008).

Licence

The Dataset is released for non-commercial research and educational purposes under the following licence agreement:

  1. By downloading this dataset and licence, this licence agreement is entered into, effective this date, between you, the Licensee, and the University of Cambridge, the Licensor.
  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee.
  3. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes.
  4. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.
  5. The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the relevant publication(s) mentioned next to the download link.
  6. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
  7. The Licensor grants the Licensee this right to use the licensed dataset ‘as is’. Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever.
  8. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.

Download

You may download the adjective–noun Dataset if you agree to the licence above. Publications using it should include a reference to the following publication:

  • Yannakoudakis, Helen, Ted Briscoe and Ben Medlock: ‘A New Dataset and Method for Automatically Grading ESOL Texts’, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
DOWNLOAD

Cambridge English Readability Dataset

The Cambridge English Readability Dataset is composed of reading passages from the five main suite Cambridge English Exams (KET, PET, FCE, CAE, CPE). These five exams are targeted at learners at A2–C2 levels of the Common European Framework of Reference (CEFR). The documents are harvested from all the tasks in the past reading papers for each of the exams. The Cambridge English Exams are designed for L2 learners specifically and the A2–C2 levels assigned to each reading paper can be treated as the level of reading difficulty of the documents for the L2 learners.

Licence

The Dataset is released for non-commercial research and educational purposes under the following licence agreement:

  1. By downloading this dataset and licence, this licence agreement is entered into, effective this date, between you, the Licensee, and the University of Cambridge, the Licensor.
  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee.
  3. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes.
  4. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.
  5. The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the following publication: Menglin Xia, Ekaterina Kochmar and Ted Briscoe (2016). Text Readability Assessment for Second Language Learners. Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications.
  6. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
  7. The Licensor grants the Licensee this right to use the licensed dataset ‘as is’. Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever.
  8. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.

Download

You may download the Cambridge English Readability Dataset if you agree to the licence above.

Publications using it must include a reference to the following publication:

  • Menglin Xia, Ekaterina Kochmar and Ted Briscoe (2016). Text
    Readability Assessment for Second Language Learners. Proceedings of the
    11th Workshop on Innovative Use of NLP for Building Educational
    Applications.
DOWNLOAD