Araneum Anglicum
English Web Corpus
Crawled in November 2013 by
SpiderLing
0.72
No top-level domain restriction
Language similarity threshold set to 0.5
Tagged by
Tree Tagger
using the English parameter file based on the
Penn Treebank Tagset
Native tagset mapped to
Araneum Universal Tagset
Paragraph-level deduplicated by
Onion
, tokens in duplicate paragraphs marked
Versions available
Araneum Anglicum Maius: 1,200,005,994 tokens, 888,421,345 unmarked words
Araneum Anglicum Minus: 119,344,583 tokens, 96,226,386 unmarked words
Revision history
14.12 Refiltered, retagged and rededuplicated; even newer sketch grammar
14.09 New sketch grammar
14.04 Initial publicly released version