Methods of Corpus Analysis and Corpus Classification
Multidimensional Corpus Analyses
In this subproject we explore methods with the help of which a given linguistic phenomenon can be reviewed in order to find out, if this phenomenon shows a noticeable frequency distribution, which could be relevant for a given linguistic question. This includes the dimensions time, genre, topic or style, for example. We understand linguistic phenomena as all objects that occur in a given linguistic sample and can principally be quantified: ranging from single words over complex expressions to abstract syntactical structures or communication events.
Results (Selection)
- Visualisation of temporal progressions in a semantic context
- Systematic Generation of Time Behaviour Charts for the Online Dictionary of Neologisms of the IDS project Neuer Wortschatz (see also Online Documentation for OWID)
- based on the diachronic frequency distribution of a word, various formal measures quantify the confidence that it qualifies as a neologism candidate
- various filters that distinguish known groups from obvious non-neologisms (regionalisms, proper names, editorial abbreviations, ...)
- an empirically constructed typology of the diachronic frequency distribution of verified neologisms
Relevant Research Aspects
- typology of possible dimensions (linear order, hierarchical structure, unstructured)
- universal and dimension-specific analysis methods
- unidimensional and multidimensional analysis
- use of epiphenomena / artefacts (base frequency effects, length of text effects, saturation effects)
- exploration and evaluation in specific linguistic application scenarios
Current Main Subjects
The current research works concentrate on the dimension of time. In this process, methods for automatic detection of neologism candidates emerge. These are words, which show a diachronic frequency distribution typical for neologisms. In a collaboration with the in-house project Lexical Innovations, these methods are being evaluated and developed further in an ongoing process.
Publications (Selection)
- Fankhauser, Peter / Kupietz, Marc (2017): Visualizing Language Change in a Corpus of Contemporary German (Corpus Linguistics Conference, Birmingham).
- Keibel, Holger (2009): Mathematische Häufigkeitsmaße in der Korpuslinguistik: Eigenschaften und Verwendung. (Erw. und überarb. 2. Aufl.). Mannheim: Institut für Deutsche Sprache.
- Keibel, Holger / Sophie Hennig / Rainer Perkuhn (2011): Effiziente halbautomatische Detektion von Neologismuskandidaten. Technical Report IDS-KL-2010-01. Mannheim: Institut für Deutsche Sprache.
Contact: Dr. Harald Lüngen <luengen@ids-...>