Yahoo! Semantically Annotated Snapshot of the English Wikipedia, version 1.0

This SW1 dataset contains a snapshot of the English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. In order to build SW1, we started from the XML-ized Wikipedia dump distributed by the University of Amsterdam. This snapshot of the English Wikipedia contains 1,490,688 entries (excluding redirects). First, the text is extracted from the XML entry and split into sentences using simple heuristics. Then we ran several syntactic and semantic NLP taggers on it and collected their output. Raw Data (Multitag format) The multitag format contains all the Wikipedia text plus all the semantic tags. All other data files can be reconstructed from this. A multitag file contains several Wikipedia entries. The Wikipedia snapshot was cut into 3000 multitag files each containing roughly 500 entries.

Available Through Yahoo! Webscope program https://webscope.sandbox.yahoo.com/catalog.php?datatype=l

Publications

  • Jordi Atserias, Hugo Zaragoza, Massimiliano Ciaramita and Giuseppe Attardi " Semantically Annotated Snapshot of the English Wikipedia"