Corpora of written language

Availability

The bulk of DeReKo can be searched and analysed using COSMAS II free of charge for non-commercial purposes. For download, however, we may offer only a few sub-corpora due to copyright regulations and contractual agreements with rights holders. For more information see the FAQ: " Are there terms and conditions that allow for exceptions?"

By license agreement

If you sign a license agreement, IDS is permitted to provide free access for scientific use to the following corpora of written language:

If you are interested in these corpora, please send an email to Ms Petra Brecht with a brief description of why you need the data in TEI-XML format and why searches in DeReKo with COSMAS II or KorAP are not sufficient for your purposes.

Download Server

In addition, the following corpora are available for download, each under CC-BY-SA License

  • Corpus of speeches and interviews (rei)
  • Wikipedia Corpora:
    Conversion 2011 in colloboration with the EuroGr@mm project [1],
    Conversions 2013 and 2015 in collaboration with the programme area Forschungsinfrastrukturen [2].
    Conversion 2017 by the Programme Area Corpus Linguistics.

German language Wikipedia - Available Files 2011-2017 (Encoding ISO-8859-1)
Year WP subcorpus I5 WikiXML TreeTagger
Standoff
2011 articles wpd11.xces.bz2 -/- -/-
article talk wdd11.xces.bz2
2013 articles wpd13.i5.xml.bz2 dewikixml-20130728-articles.tar.gz wpd13.tt.xml.bz2
article talk wdd13.i5.xml.bz2 dewikixml-20130728-discussions.tar.gz wdd13.tt.xml.bz2
articles sample wpd13_sample.i5.xml.bz2 -/- -/-
article talk ample wdd13_sample.i5.xml.bz2
2015 articles wpd15.i5.xml.bz2 wpd15.wikixml.tar.gz wpd15.tt.xml.bz2
article talk wdd15.i5.xml.bz2 wdd15.wikixml.tar.gz wdd15.tt.xml.bz2
user talk wud15.i5.xml.bz2 wud15.wikixml.tar.gz wud15.tt.xml.bz2
article sample wpd15_sample.i5.xml.bz2 -/- -/-
article sample wdd15_sample.i5.xml.bz2
user talk sampleN wud15_sample.i5.xml.bz2
2017 articles wpd17.i5.xml.bz2
article talk sample wdd17.i5.xml.bz2
user talk wud17.i5.xml.bz2
redundancy talk wrd17.i5.xml.bz2


Other languages Wikipedia 2013 - available files (format I5, encoding U8)
articles article talk
French frwiki-20130904-articles.i5.bz2 frwiki-20130904-discussions.i5.bz2
Hungarian huwiki-20140503-articles.i5.bz2 huwiki-20140503-discussions.i5.bz2
Norwegian nowiki-20140512-articles.i5.bz2 nowiki-20140512-discussions.i5.bz2
Italian itwiki-20130508-articles.i5.bz2 itwiki-20130508-discussions.i5.bz2
Polish plwiki-20140503-articles.i5.bz2 plwiki-20140503-discussions.i5.bz2


Other languages Wikipedia 2015 - available files (format I5, encoding U8)
articles article talk user talk
English enwiki-20150808-article.i5.utf8.xml.bz2 enwiki-20150808-talk.i5.utf8.xml.bz2 enwiki-20150808-user-talk.i5.utf8.xml.bz2
French frwiki-20150808-article.i5.utf8.xml.bz2 frwiki-20150808-talk.i5.utf8.xml.bz2 frwiki-20150808-user-talk.i5.utf8.xml.bz2
Hungarian huwiki-20150807-article.i5.utf8.xml.bz2 huwiki-20150807-talk.i5.utf8.xml.bz2 huwiki-20150807-user-talk.i5.utf8.xml.bz2
Norwegian nowiki-20150807-article.i5.utf8.xml.bz2 nowiki-20150807-talk.i5.utf8.xml.bz2 nowiki-20150807-user-talk.i5.utf8.xml.bz2
Spanish eswiki-20150808-article.i5.utf8.xml.bz2 eswiki-20150808-talk.i5.utf8.xml.bz2 eswiki-20150808-user-talk.i5.utf8.xml.bz2
Croatian hrwiki-20150807-article.i5.utf8.xml.bz2 hrwiki-20150807-talk.i5.utf8.xml.bz2 hrwiki-20150807-user-talk.i5.utf8.xml.bz2
Italian itwiki-20150808-article.i5.utf8.xml.bz2 itwiki-20150808-talk.i5.utf8.xml.bz2 itwiki-20150808-user-talk.i5.utf8.xml.bz2
Polish plwiki-20150808-article.i5.utf8.xml.bz2 plwiki-20150808-talk.i5.utf8.xml.bz2 plwiki-20150808-user-talk.i5.utf8.xml.bz2

References

[1] Noah Bubenhofer, Stefanie Haupt, Horst Schwinn (2011): A Comparable Corpus of the Wikipedia: From Wiki Syntax to POS Tagged XML. Hamburg Working Paper in Multilingualism, 96 B
[2] Eliza Margaretha, Harald Lüngen (2014): Building linguistic corpora from Wikipedia articles and discussions. In: Journal for Language Technologie and Computational Linguistics (JLCL) 2/2014

Back to DeReKo overview

Kontakt:
    <korpuslinguistik@ids-...>
 
Leitung:
    Dr. Marc Kupietz <kupietz@ids-...>
 
Wissenschaftliche Mitarbeiter:
    Cyril Belica <belica@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
 
Kooperationen:
    siehe hier
 
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
    siehe hier
 
Studentische Hilfskräfte:

  • Caroline Iliadi
  • Ines Pisetta