Corpora of Written Language

Current Corpus Archive

Size and Extent

The IDS has started the construction of electronic text corpora in the mid sixties. The size of the corpora has increased from about 28 million text words in 1992 to 28 billion text words in 2015 (this is equivalent to about 70 million book pages, if an average of 400 words per page is assumed). Many staff members have participated in creating the largest collection of its kind worldwide. The corpus archive is being extended continually and existing corpus material is being edited in terms of quality management in an ongoing process. The results of these works are published regularly through the COSMAS II project (see Release-Chronicle).

Geographic Origin of the DeReKo Newspaper Sources

Archived Corpora

Unfortunately, a small part of the archived corpora is not accessible from outside the IDS for  copyright and licensing reasons. Over the last years, this part could be reduced to under 5%. In general, the IDS corpora may be used for scientific, non-commercial purposes only. For more details about the options available for the use of the IDS corpora see: Information regarding the availability
Sigle Name from to License Tokens
Sigle Name von bis Lizenz Tokens

    Dr. Marc Kupietz <kupietz@ids-...>
Wissenschaftliche Mitarbeiter:
    Cyril Belica <belica@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
    siehe hier
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
    siehe hier
Studentische Hilfskräfte:

  • Caroline Iliadi
  • Ines Pisetta