Corpora of written language
Availability
The bulk of DeReKo can be searched and analysed using COSMAS II free of charge for non-commercial purposes. For download, however, we may offer only a few sub-corpora due to copyright regulations and contractual agreements with rights holders. For more information see the FAQ: " Are there terms and conditions that allow for exceptions?"
By License Agreement
If you sign a license agreement, IDS is permitted to provide free access for scientific use to the following corpora of written language:
If you are interested in these corpora, please send an email to Ms Petra Brecht with a brief description of why you need the data in TEI-XML format and why searches in DeReKo with COSMAS II or KorAP are not sufficient for your purposes.
Download Server
In addition, the following corpora are available for download, each under CC-BY-SA License
- Corpus of speeches and interviews (rei)
- Wikipedia Corpora:
Conversion 2011 in colloboration with the EuroGr@mm project [1],
Conversions 2013 and 2015 in collaboration with the programme area Forschungsinfrastrukturen [2].
Conversion 2017 by the Programme Area Corpus Linguistics.
Year | WP subcorpus | I5 | WikiXML | TreeTagger Standoff |
2011 | articles | wpd11.xces.bz2 | -/- | -/- |
article talk | wdd11.xces.bz2 | |||
2013 | articles | wpd13.i5.xml.bz2 | dewikixml-20130728-articles.tar.gz | wpd13.tt.xml.bz2 |
article talk | wdd13.i5.xml.bz2 | dewikixml-20130728-discussions.tar.gz | wdd13.tt.xml.bz2 | |
articles sample | wpd13_sample.i5.xml.bz2 | -/- | -/- | |
article talk sample | wdd13_sample.i5.xml.bz2 | |||
2015 | articles | wpd15.i5.xml.bz2 | wpd15.wikixml.tar.gz | wpd15.tt.xml.bz2 |
article talk | wdd15.i5.xml.bz2 | wdd15.wikixml.tar.gz | wdd15.tt.xml.bz2 | |
user talk | wud15.i5.xml.bz2 | wud15.wikixml.tar.gz | wud15.tt.xml.bz2 | |
article sample | wpd15_sample.i5.xml.bz2 | -/- | -/- | |
article sample | wdd15_sample.i5.xml.bz2 | |||
user talk sampleN | wud15_sample.i5.xml.bz2 | |||
2017 | articles | wpd17.i5.xml.bz2 | ||
article talk sample | wdd17.i5.xml.bz2 | |||
user talk | wud17.i5.xml.bz2 | |||
redundancy talk | wrd17.i5.xml.bz2 |
articles | article talk | |
French | frwiki-20130904-articles.i5.bz2 | frwiki-20130904-discussions.i5.bz2 |
Hungarian | huwiki-20140503-articles.i5.bz2 | huwiki-20140503-discussions.i5.bz2 |
Norwegian | nowiki-20140512-articles.i5.bz2 | nowiki-20140512-discussions.i5.bz2 |
Italian | itwiki-20130508-articles.i5.bz2 | itwiki-20130508-discussions.i5.bz2 |
Polish | plwiki-20140503-articles.i5.bz2 | plwiki-20140503-discussions.i5.bz2 |
References
[1] Noah Bubenhofer, Stefanie Haupt, Horst Schwinn (2011): A Comparable Corpus of the Wikipedia: From Wiki Syntax to POS Tagged XML.
[2] Eliza Margaretha, Harald Lüngen (2014): Building linguistic corpora from Wikipedia articles and discussions. In: Journal for Language Technologie and Computational Linguistics (JLCL) 2/2014
Tools
- IDS Wikipedia Converter Java jars
- Eliza Margaretha: Documentation of the IDS Wikipedia Converter.
Back to DeReKo overview
Kontakt:
<korpuslinguistik@ids-...>
Leitung:
Dr. Marc Kupietz <kupietz@ids-...>
Wissenschaftliche Mitarbeiter:
Dr. Harald Lüngen <luengen@ids-...>
Rainer Perkuhn <perkuhn@ids-...>
Kooperationen:
siehe hier
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
siehe hier
Studentische Hilfskräfte:
- Caroline Iliadi
- Ines Pisetta