Araneum Hungaricum
Hungarian Web Corpus
Crawled in August 2014 by
SpiderLing
0.76
No top-level domain restriction
Language similarity threshold set to 0.5
Tagged by
Csaba Oravecz
using the latest version of the
Hungarian National Corpus
morphosyntactic annotation pipeline
Native tagset mapped to
Araneum Universal Tagset
Paragraph-level deduplicated by
Onion
, tokens in duplicate paragraphs marked
Versions available
Araneum Hispanicum Maius: 1,200,001,609 tokens, 792,549,686 unmarked words
Araneum Hispanicum Minus: 119,037,475 tokens, 88,183,876 unmarked words
Revision history
14.12 Initial publicly released version
Reference
Csaba Oravecz, Tamás Váradi, and Bálint Sass: The Hungarian Gigaword Corpus. In
Proceedings of LREC 2014
. Reykjavik, 2014.