Araneum Italicum
Italian Web Corpus
Crawled in October 2014 by
No top-level domain restriction
Language similarity threshold set to 0.7
Tagged by
Tree Tagger
using the UTF-8 Italian parameter file based on the
Achim Stein Italian Tagset
Native tagset mapped to Araneum Universal Tagset
Paragraph-level deduplicated by
, tokens in duplicate paragraphs marked
Versions available
Araneum Italicum Minus: 120 M tokens
Revision history
14.10 Initial publicly released Minus version