Linguistisches Forschungskolloquium:

Idioms in Woordcombinaties: how to serve both lexicography and NLP

Das Bild zeigt das Logo "IDS-Spektrum".
© Norbert Cußler-Volz, IDS

Mittwoch, den 03. Juli 2024, 16:00 bis 17:30 Uhr
Vortragssaal des IDS in R 5

Vortrag von Carole Tiberius und Lut Colman
vom Instituut voor de Nederlandse Taal (Leiden, Niederlande)

Mastering idiomatic language in its broadest sense is necessary to achieve advanced levels in language learning and performance. Therefore, phraseological information should be quickly and easily available to language learners and native speakers. To this end, the Dutch project Woordcombinaties (Word Combinations) is developing a corpus-based integrated lexicographic resource combining a collocation and idiom dictionary with a pattern dictionary. It merges a word-in-context and collocation tool, following the example of Sketch Engine for language learning (SKELL), and a pattern dictionary, following examples such as the Pattern Dictionary of English Verbs (PDEV) and Typed Predicate Argument Structures for Italian (T-PAS).

Idioms and conversational routines are also included in Woordcombinaties. Currently they are encoded as special instances among the collocations and the patterns. However, separate access with specific search options for idioms and conversational routines is planned and currently being designed. For instance, it will be possible to search for idioms based on image categories, such as ‘body parts’ and ‘food’ for een vinger in de pap hebben ‘have a finger in the pie’ and less specific sense categories, such as ‘have a property’. Conversational routines will be linked to speech acts, such as ‘greeting’ or ‘apologizing’.

Particularly challenging is determining the canonical forms of multi-word expressions, if any, and their lemma forms (see also ongoing discussions in the context of UniDive[1]). Canonical forms and lemma forms should preferably be suited for both lexicographic products and NLP purposes. Should the lemma be as reduced as possible to only the actual fixed lexicalised components? Or is there anything to be said about whether and to what extent it is useful to include open slots? Can we have it both ways?

We will describe the approach developed for Woordcombinaties and the changes required to our workflow and tools.


[1] COST Action on Universality, Diversity and Idiosyncrasy in Language Technology. Within UniDive there is a separate task on harmonising lemmatisation rules (for words and MWEs) and lexical features across languages.