Araneum Russicum
Russian Web Corpus
Crawled in April & July 2013 by
SpiderLing
0.72
TLD restriction set to include the most relevant national (
.ru, .su, .ua, .by, .il
) and generic (
.com, .edu, .info, .org, .net
) TLDs only
Language similarity threshold set to 0.5
Tagged by
Tree Tagger
using the Russian parameter file based on the
MULTEXT-East Russian tagset
Native tagset mapped to Araneum Universal Tagset
Paragraph-level deduplicated by
Onion
, tokens in duplicate paragraphs marked
Versions available
Araneum Russicum Maius: 1,200,001,911 tokens, 850,194,623 unmarked words
Araneum Russicum Minus: 120,139,611 tokens, 90,809,716 unmarked words
Revision history
15.02 Reprocessed & rededuplicated
14.04 First full release
13.10 Initial publicly released Minus version