Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of all articles when you look at the part that is english of Wikipedia that has been consumed April 2010. It absolutely was prepared, as described in more detail below, to get rid of all links and unimportant product (navigation text, etc) The corpus is untagged, natural text. Utilized by Stanford NLP (1.8 GB).
: a corpus of manually-constructed description graphs, explanatory part reviews, and associated semistructured tablestore for many publicly available primary technology exam concerns in the usa (8 MB)read more