Methods of Corpus Analysis and Corpus Classification

Lemmatization

Lemmatization allows inflexion forms, word compositions and/or other morphological forms to be assigned to their lemmata.

In this context, lemmata are

  • uninflected simplicia of different parts of speech,
  • uninflected derivations and compounds,
  • derivational morphemes.

The lemmatization method Inflexion Analysis and Decomposition of Compounds was developed by Cyril Belica in 1994 (Cyril Belica: WP2 - Lemmatizer. Final Report. MLAP93-21 MECOLB, Deliverable D5. Luxembourg, July 1994) and has been deployed since then as a module of the COSMAS system (see also Conceptual Development of the COSMAS-Platform).

In this subproject the programme system is to be further developed with a view to optimising the corpus-based inventory of the lexicon. It is planned to systematize and complete the underlying electronic lexicons and the system of rules for morphological analysis and - depending on the capacity available - possibly also the enhancement of the functionality regarding the new spelling rules, the spoken language and the historic depth of the word inventory to be lemmatized.

Back to Project Page

Contact:

Cyril Belica <belica@ids-...>