Methods of Corpus Analysis and Corpus Classification


Lemmatization allows inflexion forms, word compositions and/or other morphological forms to be assigned to their lemmata. In this context, lemmata are
  • uninflected simplicia of different parts of speech,
  • uninflected derivations and compounds,
  • derivational morphemes.
The lemmatization method Inflexion Analysis and Decomposition of Compounds was developed by Cyril Belica in 1994 (Cyril Belica: WP2 - Lemmatizer. Final Report. MLAP93-21 MECOLB, Deliverable D5. Luxembourg, July 1994) and has been deployed since then as a module of the COSMAS system (see also<link kl projekte methoden cw.html> Conceptual Development of the COSMAS-Platform). In this subproject the programme system is to be further developed with a view to optimising the corpus-based inventory of the lexicon. It is planned to systematize and complete the underlying electronic lexicons and the system of rules for morphological analysis and - depending on the capacity available - possibly also the enhancement of the functionality regarding the new spelling rules, the spoken language and the historic depth of the word inventory to be lemmatized.