Methods of Corpus Analysis and Corpus Classification

DeReWo – Corpus-Based Lemma and Word Form Lists

In this subproject we are developing methods to create frequency-based ranking lists of lemmata and word forms on the basis of random virtual corpora. By applying these methods to the Mannheim German Reference Corpus DeReKo, we generate different lists of lemmata and word forms of German language usage, for example the lemma candidate list with 350,000 entries for elexiko – the online-dictionary of contemporary German.

In addition to the various corpus-based word and basic form lists that have been available since 2007, the research focus is expanding its spectrum to include

  • corpus-based character frequency lists,
  • letter transition frequencies relevant to cursive writing, as well as
  • corpus-based collections of typical word combinations.

Corpus-Based Lemma and Word Form Lists

Current Main Subjects

  • spelling classification
  • paradigmatic classification
  • temporal / regional / text typological and similar differentiation
  • exceptions
  • quality management

DeReWo Lemma and Word Form Lists Currently Available for Download

Time and again, the Institute for the German Language keeps receiving queries regarding the “most common German words”, assuming that such requests are clear enough and therefore easy to answer. With the publication of the DeReWo lemma lists and word form lists we try to find a compromise between the fascinating diversity of our linguistic reality and the justified desire for its preferably compact, although partially simplifying description. With the help of general annotations we want to give you an overview of the issues, that are relevant for the creation and usage of such lists and which we have worked with. These general annotations are attached to the archives in their respective version. You can download the current version directly here. A detailed product-specific documentation is attached to each DeReWo-list in addition to the general annotations. The structure of this documentation is based on the structure of the general annotations. It is designed to help to understand the respective view of the language in question and the resulting simplifications and consequences for interpretation and use of the list.

Name

Type

Number of Entries

published on

 

DeReKo-2014-II-MainArchive-STT.100000

Word Form +Lemma+POS-Frequency List

100.000

December 31, 2014

download (Format zip)
download (Format 7z)

derewo-v-ww-bll-320000g-2012-12-31-1.0

Lemma List

326.946

December 31, 2012

download

derewo-v-ww-bll-250000g-2011-12-31-0.1

Lemma List

250.000

December 31, 2011

download

derewo-v-40000g-2009-12-31-0.1

Lemma List

40.000

December 31, 2009

download

derewo-v-100000t-2009-04-30-0.1

Word Form List

100.000

May12, 2009

download

derewo-v-30000g-2007-12-31-0.1

Lemma List

30.000

December 31, 2007

download

  • Using the DeReWo lists without knowing the corresponding documentation is scientifically dubious.
  • Referencing or passing on the DeReWo lists without the corresponding documentation is not allowed.
  • Commercial use of DeReWo lists is prohibited.
  • If you have problems downloading the lists, please proceed as follows:
    • first, download the archive and save it locally
    • then, unpack the archive (usually possible by double-clicking). A new folder will be created
    • start application (word processor, spreadsheet or the like)
    • load the file (not with a PDF-file-extension) from the new folder into the application
    • if required, enter the coding ISO-8859-15 (if necessary, look it up in the documentation)
    • if this does not lead to the desired results, please send an email to the address listed below

If you have any questions or suggestions, please send an email to derewo(at)ids-mannheim.de.

Corpus-based character frequency lists

DeReChar-v-uni-XXX-2018-02-28-1.0

For various occasions, it is of interest how the frequencies of the various characters (especially, for example, the letters of the German alphabet) are distributed in language use. For this purpose, too, we have carried out a series of evaluations in our collection of authentic texts, the German Reference Corpus DeReKo, which are summarised in this documentation. From this documentation, the background and characteristics of the various lists "derechar-v-uni-XXX-2018-02-28-1.0" created in the study can be seen, which are offered here in the overview as references (and also for download on the respective pages).

uniXXX= all distinctive characters German alphabet only
Calculation relative frequency Distinguish upper/lower case Ignore upper/lower case
with „andere Zeichen“ ...uni-204-a-c... ...uni-059-a-c... ...uni-030-a-l...
without „andere Zeichen“   ...uni-059-b-c... ...uni-030-b-l...

Letter transition frequencies relevant to cursive writing

DeReChar-v-[bi|uni]-[KJL|DRC]-2021-10-31-1.0

Creating frequency lists for bigrams (here: in the sense of two-character sequences) with a comparable comprehensive claim and setting (as with the above-mentioned character frequency lists) is much more complex and yields only few manageable and insight-guiding results.

From a small study which shall help evaluate different approaches to teaching conjoined handwriting, we offer here bigram frequency data (and the corresponding unigram frequency data). The use of this information only makes sense against the background of this study and with the knowledge of the documentation mentioned below.

The evaluation is based on a restricted term of tokens and bigrams and focuses on the DeReKo corpus of children's and youth literature KJL, a dataset that approximates the target group of the research question as closely as possible. In addition, the same evaluations were carried out on the current version of the dataset used as a basis for the above-mentioned DeReChar study (DRC), the results of which, however, only appear to be useful for comparison purposes with the bigram representations due to the restricted token and bigram term.

The results are presented in three ways for both datasets: (1) synoptically focused on the essential statements, especially related to the bigrams according to given categories, (2) visually supported, only bigram arrangements according to predefined categories, and (3) in quantitative overviews, all bigram and unigram frequencies.

For the various subdivisions of the representations, the frequency information, the file types and the further (processing) possibilities, please read the detailed documentation; please also note the licence notice contained therein.

DeReChar-v-XXX-YYY-2021-10-31-1.0 download:

 

     YYY=

XXX
=

KJL

DRC

Synopse

bi

…v-bi-KJL…txt

…v-bi-DRC…txt

Visualisierung

bi

…v-bi-KJL…html

…v-bi-DRC…html

Total

bi

…v-bi-KJL…csv

…v-bi-DRC…csv

uni

…v-uni-KJL…csv

…v-uni-DRC…csv

For handling the csv files, please refer to the notes in the documentation. How your browser and spreadsheet programme implement the clicking of the link depends on your local settings. In case of problems, save the file locally first, start the application first and import the file into the running application (if available, by explicit import, otherwise via Open) with the options given in the documentation.

Corpus-based collections of typical word compounds

In addition to the rather cross-sectional, comprehensive offer of typical word compounds for general language use via the co-occurrence database CCDB, we are concerned in this field of work with considerations on how subsets of typical word compounds can be elaborated for certain language excerpts or from certain perspectives. A first test version is available for the lemmas section of the valency dictionary with various options for selecting typical word combinations of different quality, which has been published under the name DeReKoll - Collocation Treasures for the German Reference Corpus. Further variants are in preparation.

Questions

If you have any questions or suggestions, please send an email to derewo(at)ids-mannheim.de.

Collaborations