My Blog

WorldTree Corpus of Explanation Graphs for Elementary Science Concerns

by wpadmin on September 29, 2020 No comments

WorldTree Corpus of Explanation Graphs for Elementary Science Concerns

Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010 (40 GB)

Wesbury Lab Wikipedia Corpus Snapshot of all articles when you look at the part that is english of Wikipedia that has been consumed April 2010. It absolutely was prepared, as described in more detail below, to get rid of all links and unimportant product (navigation text, etc) The corpus is untagged, natural text. Utilized by Stanford NLP (1.8 GB).

: a corpus of manually-constructed description graphs, explanatory part reviews, and associated semistructured tablestore for many publicly available primary technology exam concerns in the usa (8 MB)

Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)

Wikipedia XML information: complete content of all of the Wikimedia wikis, in the form of wikitext supply and metadata embedded in XML. (500 GB)

Yahoo! Responses Comprehensive Questions and Answers: Yahoo! Responses corpus as of 10/25/2007. Contains 4,483,032 concerns and their answers. (3.6 GB)

Yahoo! Responses composed of concerns expected in French: Subset for the Yahoo! Answers corpus from 2006 to 2015 composed of 1.7 million concerns posed in French, and their answers that are corresponding. (3.8 GB)

Yahoo! Responses Manner issues: subset regarding the Yahoo! Answers corpus from a 10/25/2007 dump, chosen due to their properties that are linguistic. Contains 142,627 concerns and their responses. (104 MB)

Yahoo! HTML Forms removed from Publicly Available Webpages: contains a tiny test of pages which contain complex HTML forms, contains 2.67 million complex kinds. (50+ GB)

Yahoo N-Gram Representations: This dataset contains n-gram representations. The info may act as a testbed for question task that is rewriting a common issue in IR research also to term and phrase similarity task, which can be typical in NLP research. (2.6 GB)

Yahoo! N-Grams, variation 2.0: n-grams (letter = 1 to 5), removed from a corpus of 14.6 million papers (126 million sentences that are unique 3.4 billion running terms) crawled from over 12000 news-oriented internet web sites (12 GB)

Yahoo! Re Search Logs with Relevance Judgments: Annonymized Yahoo! Re Search Logs with Relevance Judgments (1.3 GB)

Yahoo! Semantically Annotated Snapshot regarding the English Wikipedia: English Wikipedia dated from 2006-11-04 prepared with lots of publicly-available NLP tools. 1,490,688 entries. (6 GB)

Yelp: including restaurant positions and 2.2M reviews (on demand)

Youtube: 1.7 million youtube videos information (torrent)

  • Awesome datasets/NLP that are publicincludes more lists)
  • AWS Public Datasets
  • CrowdFlower: information for everybody (a lot of small studies they carried out and information acquired by crowdsourcing for the task that is specific
  • Kaggle 1, 2 (make certain though that the kaggle competition information may be used not in the competition! )
  • Open Library
  • Quora (primarily annotated corpora)
  • /r/datasets (endless set of datasets, many is scraped by amateurs though and never precisely documented or certified)
  • (another list that is big
  • Stackexchange: Opendata
  • Stanford NLP group (primarily annotated corpora and TreeBanks or real tools that are NLP
  • Yahoo! Webscope (also incorporates papers which use the information this is certainly supplied)
  • SaudiNewsNet: 31,030 Arabic newsprint articles alongwith metadata, obtained from different online Saudi magazines. (2 MB)
  • Assortment of Urdu Datasets for POS, NER and NLP russianbrides tasks.

German governmental Speeches Corpus: assortment of present speeches held by top German representatives (25 MB, 11 MTokens)

NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Readily available for free for many Universities and non-profit businesses. Need certainly to signal and deliver type to get. (on demand)

Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for topic classification. (26.1 MB)

100k German Court choices: Open Legal Data releases a dataset of 100,000 court that is german and 444,000 citations (772 MB)

  • © 2020 GitHub, Inc.
  • Terms
  • Privacy
  • Safety
  • Reputation
  • Assist
  • Contact GitHub
  • Rates
  • API
  • Training
  • We We Blog
  • About

You can’t perform that action at this time around.

You finalized in with another tab or screen. Reload to recharge your session. You finalized call at another window or tab. Reload to recharge your session.

wpadminWorldTree Corpus of Explanation Graphs for Elementary Science Concerns

Join the conversation