Methods of Corpus Analysis and Corpus Classification
Lemmatization
Lemmatization allows inflexion forms, word compositions and/or other morphological forms to be assigned to their lemmata.
In this context, lemmata are
- uninflected simplicia of different parts of speech,
- uninflected derivations and compounds,
- derivational morphemes.
The lemmatization method Inflexion Analysis and Decomposition of Compounds was developed by Cyril Belica in 1994 (Cyril Belica: WP2 - Lemmatizer. Final Report. MLAP93-21 MECOLB, Deliverable D5. Luxembourg, July 1994) and has been deployed since then as a module of the COSMAS system (see also Conceptual Development of the COSMAS-Platform).
In this subproject the programme system is to be further developed with a view to optimising the corpus-based inventory of the lexicon. It is planned to systematize and complete the underlying electronic lexicons and the system of rules for morphological analysis and - depending on the capacity available - possibly also the enhancement of the functionality regarding the new spelling rules, the spoken language and the historic depth of the word inventory to be lemmatized.
Contact:
Cyril Belica <belica@ids-...>