Wikipedia Corpora

Startseite

Organisation

Digitale Sprachwissenschaft

Corpus Linguistics

Projects

Corpus Development

[Translated:] Korpora der geschriebenen Sprache

Wikipedia Corpora

Wikipedia-Corpora 2019 (wpd19, wdd19)

Development: IDS and Université Toulouse

Time frame: Wikipedia-Dump of 1 August 2019

Extension:

	wpd19 (articles)	wdd19 (article talk)
#Texte	2.323.259	711.935
#Posts	-/-	6.480.350
#Tokens	989.006.303	403.272.910

Please note: wpd19 and wdd19 are currently only available in COSMAS II

Wikipedia Corpora 2017 (wpd17, wdd17, wud17, wrd17)

Development: IDS

Time frame: Wikipedia Dump of 1 July 2017

Extension:

	wpd17 (articles)	wdd17 (article talk)	wud17 (user talk)	wrd17 (redundancy talk)
#Texts	2065926	744857	603374	240
#Posts	-/-	7107696	5895545	52393
#Tokens	873182923	349075823	309390966	1775975

Please note: Footnotes are separated and do not appear in the running text

Download: DeReKo Downloads

Wikipedia Corpora 2015 (wpd15, wdd15, wud15)

Development: IDS

Time frame: Wikipedia Dump of April 2015

Extension:

	wpd15 (articles)	wdd15 (article talk)	wud15 (user talk)
#Texts	1.802.682	591.460	539.053
#Posts	-/-	6.200.701	5.523.769
#Tokens	796.638.747	309.897.027	271.441.322

Please note:

Footnotes: In the Wikipedia conversions the text of the footnotes appears where normally the footnote reference mark would appear. This goes back to the Wikitext source. Although in the corpus representation, these insertions are marked using I5-markup, this is not visible in the COSMAS II result view. Here the footnote insertions appear in mid-text, in some cases in mid-sentence. In these cases it is also possible that the sentence segmentation does not meet the expectations In future Wikipedia conversions, the footnotes will be separated.

Download: DeReKo Downloads

Foreign Language Wikipedia Corpora, 2015

Development: IDS

Time frame: Wikipedia Dumps of August and September 2015

Extension:

	articles #Tokens	article talk #Tokens	user talk #Tokens
English (wpe15, wde15, wue15)	2.403.943.177	1.270.217.981	2.698.338.998
French (wpf15, wdf15, wuf15)	764.459.026	137.107.729	372.639.260
Hungarian (wpu15, wdu15, wuu15)	117.987.947	8.293.799	26.215.158
Norwegian (wpn15, wdn15, wun15)	99.014.144	5.314.362	32.481.331
Spanish (wps15, wds15, wus15)	578.882.431	54.907.258	276.034.367
Croatian (wpk15, wdk15, wuk15)	46.641.724	2.480.966	18.731.167
Italian (wpi15, wdi15, wui15)	463.022.806	49.825.036	125.573.567
Polish (wpp15, wdp15, wup15)	298.207.197	16.558.557	64.126.136

Please note:

The foreign language Wikipedia corpora are not part of the Mannheim German Reference Corpus DeReKo.
Footnotes: In the Wikipedia conversions, the text of the footnotes appears in the running text, where normally the footnote reference mark would appear. This goes back to the Wikitext source. Although in the corpus representation, these insertions are marked up using I5 markup, this is not visible in the COSMAS II result view. Here the footnote insertions appear in mid-text, in some cases in mid-sentence. In these cases it is also possible that the sentence segmentation does not meet the expectations. In future Wikipedia conversions, footnotes will be separated.
Tokenization: The foreign language Wikipedia corpora have been converted with the same conversion pipeline from the Wikipedia dumps as the German corpora. That means that a tokenization was used during the import in COSMAS II which had originally been developed for the German language. In particular, the apostrophe (`) is not regarded as a token separator in this tokenization. As a consequence, in the French and Italian WP, all the proclitic articles, pronouns and other function words separated by an apostrophe together with their base word are represented as one token (for example in French l'amour, c'est, n'est, m'ennuie). That means, if COSMAS II is searched for the word form amour, the cliticized form l'amour will not appear in the result list. As a remedy, wildcard operators be used (e.g. the search query *amour), or the cliticized forms can be explicitly enumerated in the query.
Similarly, the lemmatization in COSMAS II works only for German. Hence it makes no sense to use the base form operator '&' in queries to the foreign langauge Wikipedias.
Accordingly, a hyphen is not considered a token separator. Hence, in the French WP, all forms with a phonetically induced -t- insertion are represented as one token (a-t-il, a-t-on, va-t-on etc.).

Download: DeReKo Downloads

Wikipedia Corpora, 2013

Development: IDS

Time frame: Wikipedia Dump of Juli 2013

Extension:

articles (wpd13): 689.046.830 Tokens
article talk (wdd13): 274.141.008 Tokens

Download: DeReKo Downloads

Wikipedia Corpora, 2011

Development: IDS, Projects EuroGr@mm and Corpus Development

Time frame: Wikipedia Dump of 2011

Extension:

articles (wpd11): 560.786.178 Tokens
articles talk (wdd11): 234.556.967 Tokens

Download: DeReKo Downloads

Wikipedia Corpora, 2005

Development: IDS

Time frame: Wikipedia Dump of 2005

Extension:

articles (wpd): 50.053.144 Tokens

Download: DeReKo Downloads

References

Noah Bubenhofer, Stefanie Haupt, Horst Schwinn (2011): A Comparable Corpus of the Wikipedia: From Wiki Syntax to POS Tagged XML. Hamburg Working Paper in Multilingualism, 96 B
Eliza Margaretha, Harald Lüngen (2014): Building linguistic corpora from Wikipedia articles and discussions. In: Journal for Language Technologie and Computational Linguistics (JLCL) 2/2014

[Translated:] Korpora der geschriebenen Sprache

Wikipedia Corpora

Wikipedia-Corpora 2019 (wpd19, wdd19)

Wikipedia Corpora 2017 (wpd17, wdd17, wud17, wrd17)

Wikipedia Corpora 2015 (wpd15, wdd15, wud15)

Foreign Language Wikipedia Corpora, 2015

Wikipedia Corpora, 2013

Wikipedia Corpora, 2011

Wikipedia Corpora, 2005

References

Organisationsstruktur

Informationen

Schnelleinstieg

Kontakt

Social Media