Corpus Based Lemma and Word Form Lists

Startseite

Organisation

Digitale Sprachwissenschaft

Corpus Linguistics

Projects

Methods of Analysis

Corpus Based Lemma and Word Form Lists

Methods of Corpus Analysis and Corpus Classification

DeReWo – Corpus-Based Lemma and Word Form Lists

In this subproject we are developing methods to create frequency-based ranking lists of lemmata and word forms on the basis of random virtual corpora. By applying these methods to the Mannheim German Reference Corpus DeReKo, we generate different lists of lemmata and word forms of German language usage, for example the lemma candidate list with 350,000 entries for elexiko – the online-dictionary of contemporary German.

In addition to the various corpus-based word and basic form lists that have been available since 2007, the research focus is expanding its spectrum to include

corpus-based character frequency lists,
letter transition frequencies relevant to cursive writing, as well as
corpus-based collections of typical word combinations.

Corpus-Based Lemma and Word Form Lists

Current Main Subjects

spelling classification
paradigmatic classification
temporal / regional / text typological and similar differentiation
exceptions
quality management

DeReWo Lemma and Word Form Lists Currently Available for Download

Time and again, the Institute for the German Language keeps receiving queries regarding the “most common German words”, assuming that such requests are clear enough and therefore easy to answer. With the publication of the DeReWo lemma lists and word form lists we try to find a compromise between the fascinating diversity of our linguistic reality and the justified desire for its preferably compact, although partially simplifying description. With the help of general annotations we want to give you an overview of the issues, that are relevant for the creation and usage of such lists and which we have worked with. These general annotations are attached to the archives in their respective version. You can download the current version directly here. A detailed product-specific documentation is attached to each DeReWo-list in addition to the general annotations. The structure of this documentation is based on the structure of the general annotations. It is designed to help to understand the respective view of the language in question and the resulting simplifications and consequences for interpretation and use of the list.

Name	Type	Number of Entries	published on
DeReKo-2014-II-MainArchive-STT.100000	Word Form +Lemma+POS-Frequency List	100.000	December 31, 2014	download (Format zip) download (Format 7z)
derewo-v-ww-bll-320000g-2012-12-31-1.0	Lemma List	326.946	December 31, 2012	download
derewo-v-ww-bll-250000g-2011-12-31-0.1	Lemma List	250.000	December 31, 2011	download
derewo-v-40000g-2009-12-31-0.1	Lemma List	40.000	December 31, 2009	download
derewo-v-100000t-2009-04-30-0.1	Word Form List	100.000	May12, 2009	download
derewo-v-30000g-2007-12-31-0.1	Lemma List	30.000	December 31, 2007	download

Using the DeReWo lists without knowing the corresponding documentation is scientifically dubious.
Referencing or passing on the DeReWo lists without the corresponding documentation is not allowed.
Commercial use of DeReWo lists is prohibited.
If you have problems downloading the lists, please proceed as follows:
- first, download the archive and save it locally
- then, unpack the archive (usually possible by double-clicking). A new folder will be created
- start application (word processor, spreadsheet or the like)
- load the file (not with a PDF-file-extension) from the new folder into the application
- if required, enter the coding ISO-8859-15 (if necessary, look it up in the documentation)
- if this does not lead to the desired results, please send an email to the address listed below

If you have any questions or suggestions, please send an email to derewo(at)ids-mannheim.de.

Corpus-based character frequency lists

DeReChar-v-uni-XXX-2018-02-28-1.0

For various occasions, it is of interest how the frequencies of the various characters (especially, for example, the letters of the German alphabet) are distributed in language use. For this purpose, too, we have carried out a series of evaluations in our collection of authentic texts, the German Reference Corpus DeReKo, which are summarised in this documentation. From this documentation, the background and characteristics of the various lists "derechar-v-uni-XXX-2018-02-28-1.0" created in the study can be seen, which are offered here in the overview as references (and also for download on the respective pages).

uniXXX=	all distinctive characters	German alphabet only
Calculation relative frequency	all distinctive characters	Distinguish upper/lower case	Ignore upper/lower case
with „andere Zeichen“	...uni-204-a-c...	...uni-059-a-c...	...uni-030-a-l...
without „andere Zeichen“		...uni-059-b-c...	...uni-030-b-l...

Letter transition frequencies relevant to cursive writing

DeReChar-v-[bi|uni]-[KJL|DRC]-2021-10-31-1.0

Creating frequency lists for bigrams (here: in the sense of two-character sequences) with a comparable comprehensive claim and setting (as with the above-mentioned character frequency lists) is much more complex and yields only few manageable and insight-guiding results.

From a small study which shall help evaluate different approaches to teaching conjoined handwriting, we offer here bigram frequency data (and the corresponding unigram frequency data). The use of this information only makes sense against the background of this study and with the knowledge of the documentation mentioned below.

The evaluation is based on a restricted term of tokens and bigrams and focuses on the DeReKo corpus of children's and youth literature KJL, a dataset that approximates the target group of the research question as closely as possible. In addition, the same evaluations were carried out on the current version of the dataset used as a basis for the above-mentioned DeReChar study (DRC), the results of which, however, only appear to be useful for comparison purposes with the bigram representations due to the restricted token and bigram term.

The results are presented in three ways for both datasets: (1) synoptically focused on the essential statements, especially related to the bigrams according to given categories, (2) visually supported, only bigram arrangements according to predefined categories, and (3) in quantitative overviews, all bigram and unigram frequencies.

For the various subdivisions of the representations, the frequency information, the file types and the further (processing) possibilities, please read the detailed documentation; please also note the licence notice contained therein.

DeReChar-v-XXX-YYY-2021-10-31-1.0 download:

	YYY= XXX =	KJL	DRC
Synopse	bi	…v-bi-KJL…txt	…v-bi-DRC…txt
Visualisierung	bi	…v-bi-KJL…html	…v-bi-DRC…html
Total	bi	…v-bi-KJL…csv	…v-bi-DRC…csv
Total	uni	…v-uni-KJL…csv	…v-uni-DRC…csv

For handling the csv files, please refer to the notes in the documentation. How your browser and spreadsheet programme implement the clicking of the link depends on your local settings. In case of problems, save the file locally first, start the application first and import the file into the running application (if available, by explicit import, otherwise via Open) with the options given in the documentation.

Corpus-based collections of typical word compounds

In addition to the rather cross-sectional, comprehensive offer of typical word compounds for general language use via the co-occurrence database CCDB, we are concerned in this field of work with considerations on how subsets of typical word compounds can be elaborated for certain language excerpts or from certain perspectives. A first test version is available for the lemmas section of the valency dictionary with various options for selecting typical word combinations of different quality, which has been published under the name DeReKoll - Collocation Treasures for the German Reference Corpus. Further variants are in preparation.

Questions

If you have any questions or suggestions, please send an email to derewo(at)ids-mannheim.de.

Collaborations

Tokyo University of Foreign Studies; Global COE Program Corpus-based Linguistics and Language Education (CbLLE)
Interactions between Linguistic and Bioinformatic Procedures, Methods and Algorithms. Modelling and Presentation of Variance in Language and Genomes. Joint project within the framework of a BMBF funding priority.

Back to Project Page

DeReWo – Corpus-Based Lemma and Word Form Lists

Corpus-Based Lemma and Word Form Lists

Current Main Subjects

DeReWo Lemma and Word Form Lists Currently Available for Download

Corpus-based character frequency lists

DeReChar-v-uni-XXX-2018-02-28-1.0

Letter transition frequencies relevant to cursive writing

DeReChar-v-[bi|uni]-[KJL|DRC]-2021-10-31-1.0

Corpus-based collections of typical word compounds

Questions

Collaborations

Organisationsstruktur

Informationen

Schnelleinstieg

Kontakt

Social Media