Araneum Nederlandicum
Dutch Web Corpus
Crawled in November 2013 by
No top-level domain restriction
Language similarity threshold set to 0.5
Tagged by
Tree Tagger
using the Dutch parameter file based on the
Noname Tagset
Native tagset mapped to Araneum Universal Tagset
Paragraph-level deduplicated by
, tokens in duplicate paragraphs marked
Versions available
Araneum Nederlandicum Maius: 1,200,000,837 tokens, 713,417,518 unmarked words
Araneum Nederlandicum Minus: 120,732,571 tokens, 90,887,838 unmarked words
Revision history
14.04 Initial publicly released version