Corpus Linguistics Programme Area
What is Corpus Linguistics?
The aim of corpus linguistics is to gain new insights into the structure, principles, features and functions of language through exploratory analysis of very large data collections of naturally-occurring language.
At the IDS, in the programme area of corpus linguistics, a number of methodological research aims are being formulated which focus on progress in the development of corpus-driven, structure-discovering analysis methods, taking up different fundamental questions of descriptive linguistics, while keeping the theoretical background in mind. By using systematic generalisations of the thus acquired insights, the evaluation of existing, and the formulation of new linguistic hypotheses and formal models is sought.
The results of communication processes recorded in corpora are taken as both the empirical foundation for exploratory analysis and the inductive strategy of generalisation, which aims at theory formation.Although this approach sets out from lexical units and their contexts, here the lexical, syntactical and semantic levels are not separate from each other: The term collocation, extended by the terms variance and polynominality and operationalised by means of mathematical-statistical pattern-oriented methods in empirical language data, plays a fundamental role in the lexicon-syntax continuum postulated above. The purpose of this approach is to uncover laws of preference relation, which are characterised – among others – by the fact that they do not primarily vary in a rule-based manner, but depending on pragmatic, linguistic and extralinguistic factors. Moreover, subtle linguistic structures, which are not accessible to the linguistic intuition of individual language users, can only be traced and detected by analysing large amounts of data.
- The programme area is responsible for the continuous sampling of contemporary written German language usage in the Mannheim German Reference Corpus.
- Starting from fundamental considerations on linguistic theory formation, a methodology is developed, based on mathematical-statistical methods and techniques for their interpretation.
- The generalisations obtained by working on this methodological research are reflected upon on a scientific level and are brought into the discussion of linguistic theory formation.
- Furthermore, the developed methodology is introduced into linguistic research and often applied in collaboration with other - sometimes external – projects in order to identify and describe multiword expressions and other linguistic structures of preference relation.
DeReWo – Korpusbasierte Wortlisten zu DeReKo ( pdf, 288K)
Das IDS-Textmodell (pdf, 212K)
(Near) Duplicate Detection in the IDS Corpora ( pdf, 768K, englisch)
Thematische Erschließung von Korpora ( pdf, 220K)
MDCA – Methodik multidimensionaler Korpusanalysen ( pdf, 386K)
CCDB – A Corpus-Linguistic Research and Development Workbench ( pdf, 432K, englisch)
VICOMTE – Systematic Exploration of Collocation Profiles ( pdf, 1.2M, englisch)
CNS – Kontrastierung von nahen Synonymen ( pdf, 316K)
- Dr. Marc Kupietz (Leitung)
- Franck Bodmer
- Nils Diewald
- Dr. Peter Fankhauser
- Peter M. Fischer
- Dr. Harald Lüngen
- Eliza Margaretha
- Rainer Perkuhn
- Ines Pisetta
- Helge Stallkamp
- Rameela Yaddehige
- Nicolas Arnold