French-German Colloquium WikiCorp 2018

Fostering linguistic studies on Wikipedia discussions

Multilingual corpus building, annotation and exploration tools

Two-day colloquium at Université Nice Côte d'Azur (FR)
July 9-10, 2018

Goals of the colloquium

The colloquium is committed to the long-term goal of building comparable French-German discussion corpora as a special type of big CMC corpora using TEI-compliant standards. These shall serve as a basis to further develop common tools and methods for the cross-lingual, corpus-based analyses of interaction, politeness and conflict.

  1. Objectives concerning corpus building, standards, and tools:
    Harmonize the parameters of the so far separate French and German Wikipedia corpus building processes in order to make them interoperable for D-F contrastive and cross-lingual analyses: further develop the standards of the TEI CMC SIG; align metadata categories and value taxonomies.
  2. Objectives concerning interaction analyses:
    Develop annotation categories for interaction patterns, politeness cues, and conflict analysis, joint representation of conflict structures.
  3. Objectives concerning corpus analysis methods:
    Develop and adapt corpus-linguistic methods from KorAP and Textométrie to explore and visualize cross-lingual analyses on Wikipedia discussion corpora; prepare the exploration of cross-linguistic distributional semantics by training word embedding models on the French and German Wikipedias.
Invited speakers

David Laniado, Eurecat, Barcelona
Torsten Zesch, Universität Duisburg-Essen


Céline Poudat, Université Nice Côte d'Azur
Angelika Storrer, Universität Mannheim
Harald Lüngen, Institut für Deutsche Sprache, Mannheim
Laura Herzberg, Universität Mannheim

Location: Université Nice Côte d'Azur, Campus Saint-Jean-d’Angely 3, MSHS building, Salle Plate (Google maps link). The easiest way to come is to take the tramway; get off at the stop Saint-Jean d’Angely Université. The MSHS building (with a clock) is just in front of the Tramway station.

Local organisation: Céline Poudat and BCL team in Nice

Funding: Huma-Num CORLI consortium

Confirmed Participants (last updated 2018-06-28)
Elena Cabrio, Wimmics, Université Côte d’Azur
Natalia Grabar, STL, Université Lille 3
Laura Herzberg, Universität Mannheim
Mai Ho-Dac, CLLE-ERSS, Université Toulouse
Marc Kupietz, Institut für Deutsche Sprache, Mannheim
David Laniado, Eurecat, Barcelona
Harald Lüngen, Institut für Deutsche Sprache, Mannheim
Christophe Parisse, Head of Ortolang, MoDyCO, Université Paris X-Nanterre
Céline Poudat, BCL, Université Côte d’Azur
Angelika Storrer, Universität Mannheim
Serena Villata, Wimmics, Université Côte d’Azur
Laurent Vanni, BCL, Université Côte d’Azur
Torsten Zesch, Universität Duisburg-Essen

Follow-up activity

A post-conference publication on Wikipedia corpus building, annotation and exploration is planned, either as a book publication or as a special issue of a journal such as Corpus.


To registrate for participation, please fill in the registration form and send it until 2 July 2018 via email to

PRELIMINARY SCHEDULE (last updated 2018-05-29)
Monday, 9 July 2018
9:30-10:00 Opening
10:00-12:00 Section I: Joint corpus building, standards, and tools

► Christophe Parisse: Sharing corpora in repositories: using the TEI as an exchange format across various types of language data
► Mai Ho-Dac: The WikiDisc Corpus : Available metadata and interactional features
► Harald Lüngen: Formats and Features of the IDS Wikipedia Corpora
► Natalia Grabar: Building comparable corpora from the French Wikipedia and alignment of parallel sentences
12:00-13:30 Lunch
13:30-16:00 Section II: Linguistic analyses of social interaction and conflicts

Invited Talk
► Torsten Zesch: Annotating, Detecting, and Understanding Stance in Computer-Mediated Debates

► Laura Herzberg: Analysing social interaction in Wikipedia discussions
► Céline Poudat: Linguistic annotation of disagreement and conflict in Wikipedia discussions
► Elena Cabrio and Serena Villata: Argument mining on the Web

Discussion and documentation of desiderata and requirements of Sections I and II
16:00-16:30 Coffee
16:30-17:00 Breakout Session, ad Section I, (a)
- Alignment of components and metadata of the French and German WP corpora
- TEI representation issues of Wikipedia talk
17:00-17:30 Breakout Session, ad Section II
- Representing and annotating the structure of CMC interaction on WP talk pages
- Annotation layers and categories of interaction patterns
- Automated detection and annotation of stance and conflict detection
20:00 Dinner
Tuesday, 10 July 2018
9:30-10:00 Documentation of the Results of the two Breakout Sessions from Day 1
10:00-12:00 Section III: Corpus analysis methods

Invited Talk
► David Laniado: Visualization of social interactions in Wikipedia

► Céline Poudat and Laurent Vanni: Looking for characteristic patterns using deep learning methods with Hyperbase Web
► Marc Kupietz: Current developments for corpus query, analysis and visualisation at IDS

Discussion and documentation of desiderata and requirements
12:00-14:00 Lunch
14.00-14:30 Breakout Session - ad Section III
14.30-15:00 Breakout Session - ad Section I (b)
- Impulse presentation by Marc Kupietz: Current initiatives on comparable corpora: EuReCo, ICC
- Discussion of methods and resources for comparable corpus building
15:00-15:30 Documentation of the results of the two Breakout Sessions
15.30-16:00 Coffee
16:00-17:30 ► Planning the post-conference publication.
► Planning the implementation of results, follow-up activities, projects, and further co-operation
► Wrap-up of the colloquium


Wikipedia is one of the most successful projects of the Web 2.0. Since its launch in 2001, thousands of contributors have built this huge knowledge resource, which is not only used as an online encyclopedia, but also as an object of research in many academic disciplines. It also constitutes a rich and unique resource for linguistic studies, first of all because of its multilinguality, and secondly because of its huge discussion spaces, in which the collaborative writing effort is negotiated. These so-called talk pages can be used as big corpus resources of Computer-Mediated Communication (CMC).

The French and German participants of the colloquium are part of an initiative which aims to foster linguistic studies on Wikipedia, providing recommendations for the building of Wikipedia standardized corpora, methods for their linguistic processing and exploration, and descriptors and annotations for the analysis of talk pages. The French-German team of proposers started co-operating in 2016 with a first workshop in Mannheim entitled “Wikipedia: Discourse and corpus linguistic perspectives”. Since then, the proposers and other participants have co-operated in various constellations on conferences, for joint publications and proposals. The group is now ready to prepare the ground for jointly building comparable French-German corpora to be used in cross-lingual, corpus-based analyses of Wikipedia discussions.

State of corpus technology and corpus-based analyses of Wikipedia discussions in the French-German group

Up to now, most linguistic studies on Wikipedia are focused on the article pages, and do not go into a deep analysis of the linguistic features used in the discussion spaces. This may be due to three reasons: (i) Wikipedia is quite a complex object that linguists have difficulties to manipulate; (ii) Wikipedia interactions need specific descriptors and ad hoc annotations for analysis; and (iii) existing corpus technologies and exploration tools need to be adjusted to the specificities of CMC corpora in general and Wikipedia corpora in particular. More sophisticated tools and methods for the linguistic annotation and corpus exploration are needed to better exploit the huge and valuable corpus resources that can be constructed from Wikipedia discussions.

The colloquium will bring together researchers that have solid experience with preparing monolingual (French and German) corpora from Wikipedia, with their dissemination and providing corpus technology for their analysis, and with conducting linguistic research on social interaction in Wikipedia discussions with a particular interest on the analysis and detection of conflicts. Their previous work on Wikipedia analysis include studies on conflict annotation and conflict detection (e.g. Poudat et al. 2016, Poudat et al. 2017, Ho-Dac et al. 2017, Poudat & Ho-Dac (to appear), studies on writing style and language variation (Storrer 2013), methodological issues related to Wikipedia discourse analyses (Gredel 2017), cross-lingual and cross-mode studies on interaction signs (Herzberg & Storrer 2017), and work on multimodal aspects of Wikipedia (Wessler et al. 2017).

The French and German participants are involved in national corpus initiatives – e.g. CLARIN-D (Common Language Resources and Technology Infrastructure) in Germany and the CORLI consortium (Huma-Num national consortium for the study of Language, Corpora and Interactions) in France. They have a strong common interest in developing and using standards for the annotation of Wikipedia corpora to be used in linguistic research projects. They have contributed to the "Computer-Mediated Communication" Special interest group (SIG) of the Text Encoding Initiative (TEI), (cf. e.g. Margaretha & Lüngen 2014, Chanier et al. 2014, Beißwenger et al. 2017), and have presented papers and exchanged ideas at the CMC corpora conferences in Rennes (2015), Ljubljana (2016).

The partner IDS is committed to the idea of a European Reference Corpus (EuReCo): joining the forces of national and reference corpora initiatives to build and exploit multilingual comparable corpora from existing monolingual resources that remain physically located at their hosting institutions by combining them virtually (currently co-operating with the Hungarian and Romanian national corpora, cf. Kupietz et al. 2017). BLC and CLLE, hand in hand with the French national consortium CORLI (Corpus, Languages and Interactions), could be the relevant French partners for the EuReCo initiative with their Wikipedia corpora as a starting point. The colloquium shall be used to establish such a co-operation.

Moreover, both German and French participants are interested in corpus exploration methods. The colloquium aims to foster the exchange between researchers working with textometric methods, which are particularly well developed in France under the name Textométrie or Statistiques textuelles (cf. Poudat & Landragin 2017) and researchers developing methods for corpus analysis from the IDS, with its 50-year-long tradition in corpus-based language research (Lüngen & Kupietz 2017, Fankhauser & Kupietz 2017). Both sides are extremely interested in adapting their corpus frameworks to specific features of digital genres and CMC.


Beißwenger, M., Chanier, T., Erjavec, T., Fišer, D., Herold, A., Lubešic, N., Lüngen, H., Poudat, C., Stemle, E., Storrer, A. & Wigham, C. (2017). Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries. In: Selected Papers from the CLARIN Annual Conference 2016, October 26–28, 2016, Aix-en-Provence, France. Linköping University Electronic Conference Proceedings. pp. 1-18.

Borra, E., Weltevrede, E., Ciuccarelli, P., Kaltenbrunner, A., Laniado, D., Magni, G. & Venturini, T. (2015). Societal Controversies in Wikipedia Articles. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (pp. 193–196). New York, NY, USA: ACM.

Borra, E., Weltevrede, E., Ciuccarelli, P., Kaltenbrunner, A., Laniado, D., Magni, G., & Venturini, T. (2014). Contropedia - the Analysis and Visualization of Controversies in Wikipedia Articles. In: Proceedings of The International Symposium on Open Collaboration (pp. 34:1–34:1). New York, NY, USA: ACM.

Chanier, T., Poudat, C., Sagot, B., Antoniadis, G., Wigham, C., Hriba, L., Longhi, J. & Seddah, D. (2014). The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. In: Journal for Language Technology and Computational Linguistics (JLCL), 29(2), pp. 1–30.

Fankhauser, P., Kupietz, M. (2017). Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference, University of Birmingham.

Ferschke, O., Zesch T. & Gurevych, I. (2011). Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History Oliver Ferschke, Torsten Zesch and Iryna Gurevych, In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. System Demonstrations.

Gredel, E. (2017). Digital discourse analysis and Wikipedia: Bridging the gap between Foucauldian discourse analysis and digital conversation analysis. In In: Journal of Pragmatics (115), pp. 99-114. [Special Issue on Microanalysis of Online Data]

Herzberg, L. & Storrer, A. (2017). Investigating interaction signs across genres, modes and languages: The example of OKAY. In: Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17), Bozen/Bolzano, pp. 16-20. (doi: 10.5281/zenodo.1040875)

Ho-Dac, L-M, Laippala, V., Poudat, C. & Tanguy, L. (2017). Exploring Wikipedia Talk Pages for Conflict Detection. In: Darja Fišer & Michael Beißwenger (ed): Investigating Computer-Mediated Communication: Corpus-Based Approaches to Language in the Digital World. Ljubljana, pp. 146-168.

Horsmann, T, Beißwenger, M. & Zesch, T.(2017). Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain. In: Proceedings of the Conference on CMC and Social Media Corpora for the Humanities.

Horsmann, T, Beißwenger, M. & Zesch, T. (2017). Part-of-speech tagging for corpora of computer-mediated communication: A case study on finding rare phenomena. Chapter in: Investigating Computer-mediated Communication: Corpus-based approaches to language in the digital world.

Horsmann, T. & Zesch, T. (2016): FlexTag: A Highly Flexible Pos Tagging Framework. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA).

Kupietz, M., Witt, A., Bański, P., Tufiş, D., Cristea, D. & Váradi, T. (2017). EuReCo – Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research. In: Bański, Piotr/Kupietz, Marc/Lüngen, Harald/Rayson, Paul/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Mariani, John/Stevenson, Mark/Sick, Theresa (eds.): Proceedings of the Workshop CMLC-5+BigNLP 2017. Birmingham, 24 July 2017. Mannheim: Institut für Deutsche Sprache, 2017. pp. 15-19.

Laniado, D., & Tasso, R. (2011). Co-authorship 2.0: Patterns of Collaboration in Wikipedia. In: Proceedings of the 22Nd ACM Conference on Hypertext and Hypermedia (pp. 201–210). New York, NY, USA: ACM.

Laniado, D., Tasso, R., Volkovich, Y., & Kaltenbrunner, A. (2011). When the Wikipedians Talk: Network and Tree Structure of Wikipedia Discussion Pages. In: Fifth International AAAI Conference on Weblogs and Social Media. Retrieved from

Lüngen, H. & Kupietz, M. (2017). CMC Corpora in DeReKo. In: Bański, Piotr/Kupietz, Marc/Lüngen, Harald/Rayson, Paul/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Mariani, John/Stevenson, Mark/Sick, Theresa (eds.): Proceedings of the Workshop CMLC-5+BigNLP 2017. Birmingham, 24 July 2017. Mannheim: Institut für Deutsche Sprache, 2017. pp. 20-24.

Margaretha, E. & Lüngen, H. (2014). Building linguistic corpora from Wikipedia articles and discussions. In: Journal for Language Technology and Computational Linguistics (JLCL), 29(2), pp. 59–82.

Poudat, C. & Ho-Dac, M. (To appear). Désaccords et conflits dans le Wikpédia francophone. In: Proceedings of CERLICO 2017.

Poudat, C., Vanni, L. & Grabar, N. (2016). How to explore conflicts in Wikipedia talk pages? In Actes des JADT 2016, 7-10 juin 2016, Nice, vol. 2, pp. 645–656.

Poudat, C., Grabar, N., Paloque-Berges, C., Chanier, T. & Kun, J. (2017). Wikiconflits : un corpus de discussions éditoriales conflictuelles du Wikipédia francophone. Ciara R. Wigham & Gudrun Ledegen. Corpus de communication médiée par les réseaux : construction, structuration, analyse, L'Harmattan, 978-2-343-11212-1. 〈hal-01485427〉

Storrer, A. (2013). Sprachstil und Sprachvariation in sozialen Netzwerken. In B. Frank-Job, A. Mehler, & T. Sutter (Eds.), Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen an Beispielen des WWW, Wiesbaden: VS Verlag für Sozialwissenschaften, pp. 331–366.

Wessler, H., Theil, C. K., Stuckenschmidt, H., Storrer, A. & Debus, M. (2017). Wikiganda: Detecting Bias in Multimodal Wikipedia Entries. In: Seizov, Ognyan & Janina Wildfeuer (eds.): New Studies in Multimodality. Bloomsbury. pp. 201-224.

Wojatzki, M., Zesch, T. (2017): Neural, Non-neural and Hybrid Stance Detection in Tweets on Catalan Independence. In: Papers of 2nd SEPLN Workshop on Evaluation of Human Language Technologies for IberianLanguages (IBEREVAL), volume 2, 2017.

Wojatzki, M., Zesch, T. (2016): Stance-based Argument Mining – Modeling Implicit Argumentation Using Stance. In: Proceedings of the KONVENS, 2016.

Zesch, T. (2012): Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History, In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012).