Prepared by Vladimír
Benko within the framework of a joint Project of
Main design decisions
- Slovak-Centric (languages spoken and/or taught in Slovakia
and its neighbouring countries)
- Latin names denoting language and size
- Crawled by SpiderLing
at (approximately) the same time
- Language-independent filtration by the same tools
- Language-dependent filtration by the same methodology
by open-source or free tools,
native tagsets mapped to Araneum Universal Tagset
- Document-level deduplicated, duplicate and near-duplicate documents deleted
- Paragraph and/or sentence-level deduplicated, duplicate and near-duplicate segments marked
- Word sketches with compatible sketch grammars
- Accessible online via web interface
Engine) at unesco.uniba.sk or
(no registration required in Guest mode)
- Also hosted (under KonText)
at kontext.korpus.cz (free registration required), and
(under Sketch Engine) at www.sketchengine.co.uk
(paid access, 30-day free trial available)
Aranea Corpora available (May 2016)
If you use the Aranea corpora for research purposes, or need to mention them for any reason,
please cite the following paper(s):
- Benko, Vladimír: Aranea: Yet Another Family of (Comparable) Web Corpora.
In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (Eds.):
Text, Speech and Dialogue. 17th International Conference,
TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings.
Springer International Publishing Switzerland, 2014. pp. 257-264.
ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online).
- Benko, Vladimír: Compatible Sketch Grammars for Comparable Corpora.
In Andrea Abel, Chiara Vettori, Natascia Ralli
(Eds.): Proceedings of the XVI EURALEX International Congress: The User In Focus. 15–19 July 2014.
Bolzano/Bozen: Eurac Research, 2014. pp. 417-430. ISBN 978-88-88906-97-3.
As well as the paper on the NoSketch Engine:
- Rychlý, Pavel: Manatee/Bonito – A Modular Corpus Manager.
In 1st Workshop on Recent Advances in Slavonic Natural Language Processing.
Brno: Masaryk University, 2007, pp. 65-70. ISBN 978-80-210-4471-5.
If you need the source corpus data,
please send a message to vladimir.benko