KorAP – Next Generation Corpus Analysis Platform

KorAP is a new corpus analysis platform, optimized for large, multiple annotated corpora and complex search mechanisms. Here you can try out the search in DeReKo.

Background

Systematically compiled electronic collections of recorded acts of communication, so-called corpora, are now the most important empirical basis of linguistics. They are being applied to confirm or refute hypotheses and also serve as a direct subject of explorative research. Suitable tools are indispensable especially to make large corpora manageable for linguists. These tools have to be capable of managing very large amounts of data without loss and of offering CPU-intensive functions for their methodologically valid analyses.

The world´s largest collections of German linguistic data are stored at the Institute for the German Language in the Archive for Spoken German (AGD) and the Mannheim German Reference Corpus (DeReKo). In order to access the latter in particular, the Corpus Search, Management and Analysis Systems (COSMAS I and COSMAS II) were created at the IDS. These systems have proven to be successful in continuous operation since 1991 and 2003 respectively. Since COSMAS II has already been designed in the early nineties and the effort needed to expand such software increases disproportionately with rising life-span and complexity, it is becoming more and more challenging to adapt the software to rapidly changing requirements. Meanwhile the conditions of the technical as well as the scientific framework have changed so drastically that the development of a new type of analysis tool is now desirable.

Therefore, the aim of the KorAP project is to develop a novel corpus analysis platform as a basis for handling very large corpora in a methodologically valid way and to give a foundation for empirical research of German in the area of linguistics

New Challenges

Within linguistics new trends have been observed in the past years. These will inevitably lead to adjustments of the methods and tools applied in research now. The resounding success and the progressive dissemination of e-Science within Humanities (“e-Humanities”) have been accompanied by intensified “empiricalisation” or “scientification”. Not only is there a growing importance of research data to be noticed, but it can also be observed that linguists pay increased attention to the operability and applicability of scientific maxims like falsifiability and replicability. The fact, that in the course of these trends, researchers are becoming more anxious to persistently use the detected data in spite of growing amounts of data and strongly dynamised corpora, becomes visible in the increased need for their confirmability and replicability. Scientists` strong wish of working collaboratively on data and of applying the software used for the research regardless of their location currently manifests in the emergence of distributed research infrastructures such as CLARIN. This means that interfaces for such distributed infrastructures have to be provided. On this basis a federate search and analysis, a reusable definition of distributed virtual corpora, search and analysis schemes and refeeding of user annotations can be realized.

Qualitatively new challenges emerge from the immense growth of corpora in general and DeReKo in particular. While until recently the data driven analysis paradigm has been relevant mainly for lexicology only, today more complex language patterns and structures can be detected based upon very large samples. They can also be analysed depending on other factors (such as time and origin). This is now reflected not only in current trends of grammar research such as the new conference series Grammar and Corpora, but also in linguistic theory formation in general, such as new journals like Corpus Linguistics and Linguistic Theory. The consequences affect mainly the scientifically reliable support of new research methods, but also bring the necessity to implement sophisticated multidimensional analysis methods. This specifically applies to strategies that combine different approaches and work both in a data-driven as well as a hypothesis-based way. They can also operate similarly with primary and (interpretive) secondary data such as automatically generated linguistic annotations.

Apart from methodological challenges, corpus linguistics is increasingly faced with the requirement to be able to handle linguistic data of varied modality. Multi-modal resources such as digital recordings of spoken language have become an integral part of source material in research and likewise need systematic processing using established methods.

KorAP from the user's point of view

The new corpus analysis platform will take over all the features of the IDS corpus research tools COSMAS I and COSMAS II and thus continue to support the proven and valued functionalities of its predecessors. For more information, see COSMAS I and COSMAS II.

In addition, the system under development will be able to offer its users many new useful and attractive functionalities. Planned are, among others:

  •     Enhancement of the possibilities for virtual corpus composition through the inclusion of metadata and textual content properties.
  •     Improving the query language through the use of regular expressions
  •     Extension of the query syntax for uncomplicated research in multi-layer annotations (multi-layer queries)
  •     graphical display of search results multiple options for sorting hits
  •     even faster processing of even larger amounts of data

Personal

Kontakt:
<korap@ids-...>
Mitarbeiter:
Ehemalige Mitarbeiter:
Dr. Piotr Bański <banski@ids-...>
Elena Frick <frick@ids-...>
Michael Hanl
Carsten Schnober