Development and Maintenance of Contemporary Written Corpora

The world's largest collection of German-language corpora as an empirical basis for linguistic research

The Mannheim German Reference Corpus (DeReKo)

The Corpora of Contemporary Written German at the IDS

  • constitute the world's largest linguistically motivated collection (over 57.6 billion words as of January, 2024) of electronic corpora with written German texts from today and the recent past
  • can be accessed via COSMAS II and KorAP free of charge
  • contain belletristic, scientific and popular scientific texts, a large number of newspaper texts as well as a wide range of additional text types. They are being developed continuously
  • are being acquired aiming at maximizing size and diversity, allowing the creation of virtual corpora while using COSMAS II and KorAP. These can be either representative corpora or corpora designed for particular research question
Status 03/2022

Current DeReKo Extensions

  • DeReKo-2024-I brings the following new sources:
    • KEM Konstruktion|Automation
    • Cicero
    • Manager-Magazin
  • The following new additions have been made with release DeReKo-2022-I:
    • NottDeuYTsch: YouTube-Kommentare-Korpus von Louis Cotgrove – only accessible internally to IDS (ndy)
    • Twitter-Sample-Korpus  –  only accessible internally to IDS (twi21)
    • Gingko: Geschriebenes Ingenieurwissenschaftliches Korpus der U Leipzig (Written corpus of engineering sciences of the U Leipzig)
      • Automobiltechnische Zeitschrift, Jahrgänge 2007-2016 (atz)
      • Motortechnische Zeitschrift, Jahrgänge 2007-2016 (mtz)

Planned Extensions

  • Scheme literature
  • Project reports
  • First transcripts from FOLK/DGD
  • MoCoDa2-Korpus
  • Scientific and professional literature
    • ATZ und MTZ vintages from ab 2017

Recent publications on DeReKo

  • Kupietz, Marc/Lüngen, Harald/Kamocki, Paweł/Witt, Andreas (2018): The German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, Nicoletta/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Hasida, Koiti/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios/Tokunaga, Takenobu (Hrsg.): Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA), 2018. S. 4353-4360.
  • Kupietz, Marc/Lüngen, Harald (2014): Recent Developments in DeReKo. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik: ELRA, 2378-2385.  http://www.lrec-conf.org/proceedings/lrec2014/pdf/842_Paper.pdf
  • Kupietz, Marc / Belica, Cyril / Keibel, Holger / Witt, Andreas (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta: European Language Resources Association (ELRA), 1848-1854.   http://www.lrec-conf.org/proceedings/lrec2010/pdf/414_Paper.pdf
  • Kupietz, Marc / Keibel, Holger (2009): The Mannheim German Reference Corpus (DeReKo) as a basis for empirical linguistic research. In Minegishi, Makoto / Kawaguchi, Yuji (Eds.): Working Papers in Corpus-based Linguistics and Language Education, No. 3. Tokyo: Tokyo University of Foreign Studies (TUFS), 53-59.   http://cblle.tufs.ac.jp/assets/files/publications/working_papers_03/section/053-059.pdf

Kontakt:
    <korpuslinguistik(at)ids-...>
 
Leitung:
    Dr. Harald Lüngen <luengen(at)ids-...>
 
Wissenschaftliche Mitarbeiter:

    Dr. Marc Kupietz <kupietz(at)ids-...>
    Rainer Perkuhn <perkuhn(at)ids-...>
 
Kooperationen:
    siehe hier
 
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
    siehe hier
 
Studentische Hilfskräfte:

  • Nicolas Arnold