Araneum Polonicum
Polish Web Corpus
Crawled in October 2013 by
SpiderLing
0.72
No top-level domain restriction
Language similarity threshold set to 0.5
Sentence-level deduplicated, tokens in duplicate sentences marked
Tagged by
Tree Tagger
using the Polish parameter file based on the
NKJP Tagset
Native tagset mapped to
Araneum Universal Tagset
Paragraph-level deduplicated by
Onion
, tokens in duplicate paragraphs marked
Versions available
Araneum Polonicum Maius: 1,200,002,958 tokens, 785,753,805 unmarked words
Araneum Polonicum Minus: 118,847,566 tokens, 89,958,096 unmarked words
Revision history
15.02 Newly filtered, tokenized & retagged by Tree Tagger
14.04 Initial publicly released version (Tagged by Morpheusz & TaKIPI)