Gerade erschienen

Sammelband "Best Practices for Spoken Corpora in Linguistic Research", hg. von Şükriye Ruhi, Michael Haugh, Thomas Schmidt und Kai Wörner und Beitrag von Swantje Westpfahl: "STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data"
Picture of Best Practices for Spoken Corpora in Linguistic Research Şükriye Ruhi, Michael Haugh, Thomas Schmidt, Kai Wörner (eds.) (2014): Best Practices for Spoken Corpora in Linguistic Research. Newcastle: Cambridge Scholars Publishing.
Der Band geht zurück auf einen gleichnamigen Workshop bei der LREC 2012. Er enthält vierzehn Beiträge, die sich mit dem Thema guter Praktiken bei der Arbeit mit mündlichen Korpora auseinandersetzen.

Eine ausführlichere Beschreibung findet sich unter folgender URL:
http://www.cambridgescholars.com/best-practices-for-spoken-corpora-in-linguistic-research
Westpfahl, Swantje (2014): STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data. In: Lori Levin und Manfred Stede (Hg.): Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop. Dublin, Ireland: Association for Computational Linguistics and Dublin City University, pp. 1–10. Online verfügbar unter http://www.aclweb.org/anthology/W14-4901Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tübingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.