DeReWo – Corpus-Based Lemma and Word Form Lists
In this subproject we are developing methods to create frequency-based ranking lists of lemmata and word forms on the basis of random virtual corpora. By applying these methods to the Mannheim German Reference Corpus DeReKo, we generate different lists of lemmata and word forms of German language usage, for example the lemma candidate list with 350,000 entries for elexiko – the online-dictionary of contemporary German.
In addition to the various corpus-based word and basic form lists that have been available since 2007, the research focus is expanding its spectrum to include
- corpus-based character frequency lists,
- letter transition frequencies relevant to cursive writing, as well as
- corpus-based collections of typical word combinations.
Current Main Subjects
- spelling classification
- paradigmatic classification
- temporal / regional / text typological and similar differentiation
- exceptions
- quality management
DeReWo Lemma and Word Form Lists Currently Available for Download
Time and again, the Institute for the German Language keeps receiving queries regarding the “most common German words”, assuming that such requests are clear enough and therefore easy to answer. With the publication of the DeReWo lemma lists and word form lists we try to find a compromise between the fascinating diversity of our linguistic reality and the justified desire for its preferably compact, although partially simplifying description. With the help of general annotations we want to give you an overview of the issues, that are relevant for the creation and usage of such lists and which we have worked with. These general annotations are attached to the archives in their respective version. You can download the current version directly here. A detailed product-specific documentation is attached to each DeReWo-list in addition to the general annotations. The structure of this documentation is based on the structure of the general annotations. It is designed to help to understand the respective view of the language in question and the resulting simplifications and consequences for interpretation and use of the list.
Name | Type | Number of Entries | published on |
|
---|---|---|---|---|
DeReKo-2014-II-MainArchive-STT.100000 | Word Form +Lemma+POS-Frequency List | 100.000 | December 31, 2014 | |
derewo-v-ww-bll-320000g-2012-12-31-1.0 | Lemma List | 326.946 | December 31, 2012 | |
derewo-v-ww-bll-250000g-2011-12-31-0.1 | Lemma List | 250.000 | December 31, 2011 | |
derewo-v-40000g-2009-12-31-0.1 | Lemma List | 40.000 | December 31, 2009 | |
derewo-v-100000t-2009-04-30-0.1 | Word Form List | 100.000 | May12, 2009 | |
derewo-v-30000g-2007-12-31-0.1 | Lemma List | 30.000 | December 31, 2007 |
- Using the DeReWo lists without knowing the corresponding documentation is scientifically dubious.
- Referencing or passing on the DeReWo lists without the corresponding documentation is not allowed.
- Commercial use of DeReWo lists is prohibited.
- If you have problems downloading the lists, please proceed as follows:
- first, download the archive and save it locally
- then, unpack the archive (usually possible by double-clicking). A new folder will be created
- start application (word processor, spreadsheet or the like)
- load the file (not with a PDF-file-extension) from the new folder into the application
- if required, enter the coding ISO-8859-15 (if necessary, look it up in the documentation)
- if this does not lead to the desired results, please send an email to the address listed below
If you have any questions or suggestions, please send an email to derewo(at)ids-mannheim.de.
Corpus-based character frequency lists
DeReChar-v-uni-XXX-2018-02-28-1.0
For various occasions, it is of interest how the frequencies of the various characters (especially, for example, the letters of the German alphabet) are distributed in language use. For this purpose, too, we have carried out a series of evaluations in our collection of authentic texts, the German Reference Corpus DeReKo, which are summarised in this documentation. From this documentation, the background and characteristics of the various lists "derechar-v-uni-XXX-2018-02-28-1.0" created in the study can be seen, which are offered here in the overview as references (and also for download on the respective pages).
uniXXX= | all distinctive characters | German alphabet only | |
---|---|---|---|
Calculation relative frequency | Distinguish upper/lower case | Ignore upper/lower case | |
with „andere Zeichen“ | ...uni-204-a-c... | ...uni-059-a-c... | ...uni-030-a-l... |
without „andere Zeichen“ | ...uni-059-b-c... | ...uni-030-b-l... |
Letter transition frequencies relevant to cursive writing
DeReChar-v-[bi|uni]-[KJL|DRC]-2021-10-31-1.0
Creating frequency lists for bigrams (here: in the sense of two-character sequences) with a comparable comprehensive claim and setting (as with the above-mentioned character frequency lists) is much more complex and yields only few manageable and insight-guiding results.
From a small study which shall help evaluate different approaches to teaching conjoined handwriting, we offer here bigram frequency data (and the corresponding unigram frequency data). The use of this information only makes sense against the background of this study and with the knowledge of the documentation mentioned below.
The evaluation is based on a restricted term of tokens and bigrams and focuses on the DeReKo corpus of children's and youth literature KJL, a dataset that approximates the target group of the research question as closely as possible. In addition, the same evaluations were carried out on the current version of the dataset used as a basis for the above-mentioned DeReChar study (DRC), the results of which, however, only appear to be useful for comparison purposes with the bigram representations due to the restricted token and bigram term.
The results are presented in three ways for both datasets: (1) synoptically focused on the essential statements, especially related to the bigrams according to given categories, (2) visually supported, only bigram arrangements according to predefined categories, and (3) in quantitative overviews, all bigram and unigram frequencies.
For the various subdivisions of the representations, the frequency information, the file types and the further (processing) possibilities, please read the detailed documentation; please also note the licence notice contained therein.
DeReChar-v-XXX-YYY-2021-10-31-1.0 download:
| YYY= XXX= | KJL | DRC |
---|---|---|---|
Synopse | bi | ||
Visualisierung | bi | ||
Total | bi | ||
uni |
For handling the csv files, please refer to the notes in the documentation. How your browser and spreadsheet programme implement the clicking of the link depends on your local settings. In case of problems, save the file locally first, start the application first and import the file into the running application (if available, by explicit import, otherwise via Open) with the options given in the documentation.
Corpus-based collections of typical word compounds
In addition to the rather cross-sectional, comprehensive offer of typical word compounds for general language use via the co-occurrence database CCDB, we are concerned in this field of work with considerations on how subsets of typical word compounds can be elaborated for certain language excerpts or from certain perspectives. A first test version is available for the lemmas section of the valency dictionary with various options for selecting typical word combinations of different quality, which has been published under the name DeReKoll - Collocation Treasures for the German Reference Corpus. Further variants are in preparation.
Questions
If you have any questions or suggestions, please send an email to derewo(at)ids-mannheim.de.
Collaborations
- Tokyo University of Foreign Studies; Global COE Program Corpus-based Linguistics and Language Education (CbLLE)
- Interactions between Linguistic and Bioinformatic Procedures, Methods and Algorithms. Modelling and Presentation of Variance in Language and Genomes. Joint project within the framework of a BMBF funding priority.