Araneum Sinicum
Chinese (Simplified) Web Corpus
Crawled in May & June 2013 by
SpiderLing
0.72
No top-level domain restriction
Language similarity threshold set to 0.5
Tokenized by unitok
Tagged by
Tree Tagger
using the Serge Sharoff's parameter file based on the
Jan Hajič tagset
Native tagset mapped to Araneum Universal Tagset
Paragraph-level deduplicated by
Onion
, tokens in duplicate paragraphs marked
Versions available
Araneum Sinicum Maius: 1,200,001,911 tokens, 850,194,623 unmarked words
Araneum Sinicum Minus: 120,139,611 tokens, 90,809,716 unmarked words
Revision history
15.04 Initial publicly released version