Corpora of Written Language

Current Corpus Archive (as of 1/2023)

Size and Extent

The IDS has started the construction of electronic text corpora in the mid sixties. The size of the corpora has increased from about 28 million text words in 1992 to 57 billion text words in 2024 (this is equivalent to about 142 million book pages, if an average of 400 words per page is assumed). Many staff members have participated in creating the largest collection of its kind worldwide. The corpus archive is being extended continually and existing corpus material is being edited in terms of quality management in an ongoing process. The results of these works are published regularly through the Corpus Query System project (see Release-Chronicle).

Geographic Origin of the DeReKo Newspaper Sources

Sigle Name von bis Lizenz Tokens

Archived Corpora

Unfortunately, a small part of the archived corpora is not accessible from outside the IDS for copyright and licensing reasons. Over the last years, this part could be reduced to under 5%. In general, the IDS corpora may be used for scientific, non-commercial purposes only. For more details about the options available for the use of the IDS corpora see: Information regarding the availability