Program Area "Oral Corpora"

Head of program area: Dr. Henrike Helmer

Deputy head of program area: Dr. Silke Reineke

The program area "Oral Corpora" is concerned with maintaining and extending the oral corpora in the Archive for Spoken German (AGD), with developing and mediating methods and technology for compiling and working with oral corpora, and with contributing to initiatives for the establishment of standards and good practices in the area of oral corpora.

Maintenance and development of oral corpora

The Archive for Spoken German (AGD) collects and archives data of spoken German in interactions (interaction corpora) and data of domestic and non-domestic varieties of German (variation corpora). The corpora are curated in the archive and made available to the scientific public. [AGD-Flyer]

With the Research and Teaching Corpus of Spoken German (FOLK), the program area is building up a large German interaction corpus, which is made available through the Database of Spoken German.


The Conversation Analytic Information System (GAIS) provides up-to-date information on conversation analysis and related disciplines. It is designed to become an online handbook covering all elements of a typical empirical and corpus-based workflow in conversation analysis.

Project Corpus Technology

The Database of Spoken German (DGD) is a platform for accessing the oral corpora of the AGD. The DGD provides researchers, teachers and students with a web-based source for browsing and querying FOLK and other AGD data. [DGD-Flyer]

For the FOLK project, we developed the transcription editor FOLKER and the annotation tool OrthoNormal both of which are also made available to the scientific public. [FOLKER-Flyer]

In collaboration with the Hamburg Centre for Language Corpora, the program area also contributed to the development and provides support for EXMARaLDA, a system for compiling, managing and analyzing oral corpora. [EXMARaLDA-Flyer]

External projects

Featured recent publications in English:

  • Schmidt, Thomas (2016): Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project. In: Corpus Linguistic Software Tools, Journal for Language Technology and Computational Linguistics (JLCL 31/1), by Kupietz, Marc & Geyken, Alexander (eds.), pp. 127-154. PDF
  • Westpfahl, Swantje / Schmidt, Thomas (2016): FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German. In: Proceedings of the Tenth Conference on International Language Resources and Evaluation (LREC'16), Portorož, Slovenia. Paris: European Language Resources Association (ELRA), pp. 1493-1499. PDF
  • Schmidt, Thomas (2016): Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. In: Compilation, transcription, markup and annotation of spoken corpora, by Kirk, John M. and Gisle Andersen (eds.), Special Issue of the International Journal of Corpus Linguistics [IJCL 21:3], pp. 396-418.
  • Ruhi, Şükriye / Haugh, Michael / Schmidt, Thomas / Wörner, Kai (eds.) (2014): Best Practices for Spoken Corpora in Linguistic Research. Newcastle: Cambridge Scholars Publishing.
  • Schmidt, Thomas (2014): The Database for Spoken German - DGD2. In: Proceedings of the Ninth International conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 1451-1457. PDF
  • Schmidt, Thomas (2014): The Research and Teaching Corpus of Spoken German - FOLK. In: Proceedings of the Ninth International conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 383-387. PDF
  • Schmidt, Thomas / Wörner, Kai (2014): EXMARaLDA. In: Jacques Durand, Ulrike Gut, and Gjert Kristoffersen (eds.): The Oxford Handbook of Corpus Phonology. Oxford: OUP, pp. 402-419.
  • Schmidt, Thomas (2012): EXMARaLDA and the FOLK tools. Two toolsets for transcribing and annotating spoken language. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC-12), Istanbul, Turkey. European Language Resources Association (ELRA), 2012, pp. 236-240.
  • Schmidt, Thomas/Wörner, Kai (eds.) (2012): Multilingual Corpora and Multilingual Corpus Analysis. (= Hamburg Studies on Multilingualism 14). Amsterdam: Benjamins, 2012.
  • Schmidt, Thomas (2011): A TEI-based approach to standardising spoken language transcription. In: Journal of the Text Encoding Initiative 1/2011.

Projekt Korpustechnologie

Mit der Datenbank für Gesprochenes Deutsch (DGD) wird eine Plattform für den Zugriff auf mündliche Korpora des AGD entwickelt. Die DGD ermöglicht Nutzern aus Forschung und Lehre das webbasierte Browsen und Recherchieren in ausgewählten Teilen der AGD-Bestände. [DGD-Flyer]

Für die Arbeit an FOLK wurden der Transkriptionseditor FOLKER sowie das Annotationswerkzeug OrthoNormal entwickelt, die beide auch der wissenschaftlichen Öffentlichkeit zur Verfügung gestellt werden. [FOLKER-Flyer]

Der Programmbereich war außerdem zuständig für Maintenance, Weiterentwicklung und Support von EXMARaLDA, einem System zum Erstellen, Verwalten und Auswerten mündlicher Korpora. [EXMARaLDA-Flyer]