The ARRAU Corpus is a corpus annotated for anaphoric information and focusing in particular on the ‘difficult’ cases of anaphora: plural anaphora, anaphora to abstract objects, and ambiguous anaphoric expressions.
See the Publications page.
Two coding manuals were written, one for the spoken dialogue data, one for the text data:
The ARRAU Corpus is available as follows:
- from the LDC (here), except for the GNOME domain data, that are directly distributed from the authors (contact: firstname.lastname@example.org);
- from the authors, to any requester who can show they have purchased the Penn Treebank and TRAINS-93 from the LDC (contact: email@example.com);