Araneum Finnicum
Finnish Web Corpus
Crawled in June to August 2014 by
SpiderLing
0.74
No top-level domain restriction
Language similarity threshold set to 0.5
Tagged by
Tree Tagger
using the Finnish parameter file based on the
Noname Tagset
Native tagset mapped to Araneum Universal Tagset
Paragraph-level deduplicated by
Onion
, tokens in duplicate paragraphs marked
Versions available
Araneum Finnicum Maius: 1,200,000,486 tokens, 817,453,523 unmarked words
Araneum Finnicum Minus: 119,444,951 tokens, 91,859,486 unmarked words
Revision history
14.08 Initial publicly released version