Workshop "Corpus linguistics 2040: Which data, which methods, which models?"

Organisers:
Andreas Witt, Leibniz-Institut für Deutsche Sprache (IDS), Mannheim
Christian Mair, Englisches Seminar, Universität Freiburg i. Br.
Date:                   
10 – 11 July 2025
Venue:­­­­­              
Leibniz-Institut für Deutsche Sprache (IDS), Mannheim, Germany
Contact:
futurecorp@ids-mannheim.de
Key dates:
1 Dec. 2024-16 Feb. 2025:     Submission of abstracts
1 March 2025:                         Notification of acceptance
March-June 2025:                  Registration
10-11 July 2025:                      Workshop

This two-day event is designed as a scoping workshop on the future of corpus linguistics, focussing on empirical, methodological and conceptual issues facing this research community today. Although the two organising institutions focus on English and German, corpus linguists working on other language are explicitly invited to attend. We are convinced that a debate of these issues across disciplinary and language boundaries will be mutually beneficial.

The study of language structure, variation and change with digital corpora has moved from the margins to the centre of linguistics over the past five decades, inspiring and promoting usage-based models within linguistics and making (corpus-)linguistics relevant and attractive in the wider domain of the Digital Humanities. In spite of the overall success, progress has been uneven in places. First, and most importantly, while there are excellent corpus-linguistic working environments for several world languages, the vast majority of languages remains under-resourced to this day. Even for the well-resourced languages, coverage of genres and varieties is still uneven. Especially for spontaneous conversation, an important baseline for research on language structure and use, there is considerable scope for improvement in corpus size, access to the data (fully multi-modal vs. audio vs. orthographic transcription), and range of search options. Recently, corpus linguistic routines have been disrupted by advances in AI-based text generation and machine translation. Some of the challenges they pose are practical and immediate, such as the question of how future corpora should handle press and media data that are partly or fully machine-generated. Others are conceptual. Today, large reference corpora of pluricentric languages such as English, German and Spanish commonly use national standard varieties as a major ordering principle. By 2040, however, the widespread use of AI-based language technologies in everyday communication may make national boundaries less important; automatic algorithms may also partly take over from educated elites as the chief agents of linguistic standardisation. Whatever future is envisaged for corpus-linguistics, one thing remains clear: More numerous, more diverse and more complex corpora will also require more attention to issues of sustainable infrastructure for data preservation and enrichment.


The conference will feature four sessions dedicated to these key topics:

  • Corpora of spontaneous speech – new formats, new searches
  • Corpora and/with/against AI/LLMs
  • Multilingual and multimodal corpora
  • Future infrastructures for corpus linguistics and the digital humanities

For a further four sessions, we invite relevant case studies covering a wide range of object languages, posters and software demonstrations.


Bursaries: Funding permitting, we will be able to provide a limited number of travel grants, awarded on a competitive basis after the abstract submission deadline, to applicants who are early career researchers, employed on part-time or short-term contracts or facing similar challenges.