Program Area "Oral Corpora"
Head of program area: Dr. Henrike Helmer
Deputy head of program area: Dr. Silke Reineke
The program area "Oral Corpora" is concerned with maintaining and extending the oral corpora in the Archive for Spoken German (AGD), with developing and mediating methods and technology for compiling and working with oral corpora, and with contributing to initiatives for the establishment of standards and good practices in the area of oral corpora.
Maintenance and development of oral corpora
The Archive for Spoken German (AGD) collects and archives data of spoken German in interactions (interaction corpora) and data of domestic and non-domestic varieties of German (variation corpora). The corpora are curated in the archive and made available to the scientific public. [AGD-Flyer]
With the Research and Teaching Corpus of Spoken German (FOLK), the program area is building up a large German interaction corpus, which is made available through the Database of Spoken German.
Methods
The Conversation Analytic Information System (GAIS) provides up-to-date information on conversation analysis and related disciplines. It is designed to become an online handbook covering all elements of a typical empirical and corpus-based workflow in conversation analysis.
Project Corpus Technology
The Database of Spoken German (DGD) is a platform for accessing the oral corpora of the AGD. The DGD provides researchers, teachers and students with a web-based source for browsing and querying FOLK and other AGD data. [DGD-Flyer]
For the FOLK project, we developed the transcription editor FOLKER and the annotation tool OrthoNormal both of which are also made available to the scientific public. [FOLKER-Flyer]
In collaboration with the Hamburg Centre for Language Corpora, the program area also contributed to the development and provides support for EXMARaLDA, a system for compiling, managing and analyzing oral corpora. [EXMARaLDA-Flyer]
External projects
- The project "Accessing multimodal spoken language corpora: cross-linking and user-group specific differentiation" (ZuMult) is funded in the LIS program of the DFG. ZuMult is a cooperation between the Program Area "Oral Corpora", the Hamburg Centre for Language Corpora (University of Hamburg) and the Herder Institute (University of Leipzig). It aims at building a common architecture for accessing oral corpus data and developing access methods for these data which are tailored to the needs of specific user groups.
Featured recent publications in English:
- Schmidt, Thomas (2016): Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project. In: Corpus Linguistic Software Tools, Journal for Language Technology and Computational Linguistics (JLCL 31/1), by Kupietz, Marc & Geyken, Alexander (eds.), pp. 127-154. PDF
- Westpfahl, Swantje / Schmidt, Thomas (2016): FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German. In: Proceedings of the Tenth Conference on International Language Resources and Evaluation (LREC'16), Portorož, Slovenia. Paris: European Language Resources Association (ELRA), pp. 1493-1499. PDF
- Schmidt, Thomas (2016): Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. In: Compilation, transcription, markup and annotation of spoken corpora, by Kirk, John M. and Gisle Andersen (eds.), Special Issue of the International Journal of Corpus Linguistics [IJCL 21:3], pp. 396-418.
- Ruhi, Şükriye / Haugh, Michael / Schmidt, Thomas / Wörner, Kai (eds.) (2014): Best Practices for Spoken Corpora in Linguistic Research. Newcastle: Cambridge Scholars Publishing.
- Schmidt, Thomas (2014): The Database for Spoken German - DGD2. In: Proceedings of the Ninth International conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 1451-1457. PDF
- Schmidt, Thomas (2014): The Research and Teaching Corpus of Spoken German - FOLK. In: Proceedings of the Ninth International conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 383-387. PDF
- Schmidt, Thomas / Wörner, Kai (2014): EXMARaLDA. In: Jacques Durand, Ulrike Gut, and Gjert Kristoffersen (eds.): The Oxford Handbook of Corpus Phonology. Oxford: OUP, pp. 402-419.
- Schmidt, Thomas (2012): EXMARaLDA and the FOLK tools. Two toolsets for transcribing and annotating spoken language. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC-12), Istanbul, Turkey. European Language Resources Association (ELRA), 2012, pp. 236-240.
- Schmidt, Thomas/Wörner, Kai (eds.) (2012): Multilingual Corpora and Multilingual Corpus Analysis. (= Hamburg Studies on Multilingualism 14). Amsterdam: Benjamins, 2012.
- Schmidt, Thomas (2011): A TEI-based approach to standardising spoken language transcription. In: Journal of the Text Encoding Initiative 1/2011.
Projekt Korpustechnologie
Mit der Datenbank für Gesprochenes Deutsch (DGD) wird eine Plattform für den Zugriff auf mündliche Korpora des AGD entwickelt. Die DGD ermöglicht Nutzern aus Forschung und Lehre das webbasierte Browsen und Recherchieren in ausgewählten Teilen der AGD-Bestände. [DGD-Flyer]
Für die Arbeit an FOLK wurden der Transkriptionseditor FOLKER sowie das Annotationswerkzeug OrthoNormal entwickelt, die beide auch der wissenschaftlichen Öffentlichkeit zur Verfügung gestellt werden. [FOLKER-Flyer]
Der Programmbereich war außerdem zuständig für Maintenance, Weiterentwicklung und Support von EXMARaLDA, einem System zum Erstellen, Verwalten und Auswerten mündlicher Korpora. [EXMARaLDA-Flyer]
Drittmittelprojekte
-
Im Vordergrund des BMBF-Verbundprojektes "QUEST: Quality - Established: Erprobung und Anwendung von Kurationskriterien und Qualitätsstandards für audiovisuelle, annotierte Sprachdaten" steht das mögliche Nachnutzungspotential bzw. die Sekundärnutzung audiovisueller, annotierter Sprachdaten in den Geisteswissenschaften. Dazu entwickelt und erprobt QUEST aufbauend auf der Erarbeitung von Qualitätsstandards und Kurationskriterien für digitale Sprachdaten Verfahren der Qualitätssicherung für die Erstellung und Kuration solcher Ressourcen.
-
Das Projekt Zugang zu multimodalen Korpora gesprochener Sprache: Vernetzung und zielgruppenspezifische Ausdifferenzierung (ZuMult)"" wird von der DFG im LIS-Programm gefördert. In ZuMult kooperieren der Programmbereich "Mündliche Korpora" am IDS, das Hamburger Zentrum für Sprachkorpora (Universität Hamburg) und das Herder-Institut (Universität Leipzig) zum Aufbau einer gemeinsamen Architektur für den Zugang zu mündlichen Korpusdaten und zur Entwicklung von Zugangswegen zu diesen Daten, die auf spezifische Nutzergruppen zugeschnitten sind.