SegCor – ANR-DFG-Project „Segmentation of Oral Corpora“

Head of Project:

Thomas Schmidt, IDS Mannheim and Véronique Traverso, ICAR

Research associates (German team):

Arnulf Deppermann, Joachim Gasch, Jan Gorisch, Henrike Helmer, Nadine Proske, Swantje Westpfahl, Ines Rehbein

Research associates (French team 1):

Heike Baldauf-Quilliatre, Biagio Ursi, Carole Etienne, Emilie Jouin-ChardonNathalie Rossi-Gensane

Research associates (French team 2):

Flora Badin, François Delafontaine, Iris Eshkol, Layal Kanaan-Caillol, Marie Skrovec

Duration of the project:

March 2016 - February 2019

Project documentation:

Research objectives:

Since the beginning of research on spoken language, a plethora of proposals for the segmentation of spoken language have been put forward. However, there is no segmentation system yet which could be used for large corpora of spoken language, i.e. which is linguistically substantiated as well as workable for large scale corpus segmentation. The lack of theory based segmentation impedes the use of the corpora for research on language technology, comparative corpus linguistics as well as analyses in terms of spoken language interaction.

It is the aim of this project to develop methods for the segmentation of spoken language. Those methods are to be based on linguistic knowledge and at the same time adequate for the analysis of spoken language on various linguistic levels as well as for the development of tools in computational linguistics. The publication of a guideline for a systematic segmentation of various types of German and French verbal interaction is a milestone of this project. In the second stage, the possibilities of an automatized segmentation of spoken language corpora based on the segmentation guidelines will be tested and documented. This way the project does not only improve the usability of the three databases involved but also deepens our knowledge about the structures of spoken language.

The project research is based on three databases: the German research and teaching corpus of spoken German (FOLK) and the two French databases CLAPI (Corpus de LAngue Parlée en Interaction) and ESLO (Enquêtes sociolinguistiques à Orléans).


SegCor is a project funded by the German Research Foundation (DFG) and the French National Research Agency (ANR). This project is a cooperation of the department of Pragmatics of Institute for the German Language (IDS Mannheim) and two French partners: the ICAR (Interactions, Corpus, Apprentissages, Représentations) of the University of Lyon and the LLL (Laboratoire Ligérien de Linguistique) of the University of Orleans.


A preliminary version of the annotation and segmentation guidelines from a syntactic perspective, i.e., according to the topological field model, you can find here: PDF

Westpfahl, Swantje; Gorisch, Jan (2018): A Syntax-Based Scheme for the Annotation and Segmentation of German Spoken Language Interactions. In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), Spp. 109-120. Workshop at COLING 2018. Santa Fe, New Mexico, 25.-26.08.2018. PDF

Schmidt, Thomas; Westpfahl, Swantje (2018): A Study on Gaps and Syntactic Boundaries in Spoken Interaction. In: Proceedings of KONVENS 2018. Wien, Austria, 19.-21.09.2018. PDF