DEREKO I

(Project completed March, 2002)

The collaborative project for constructing and developing the German Reference Corpus (DeReKo I) funded by the State of Baden-Württemberg, started in May 1999 and ended in March 2002. Collaboration partners of the IDS have been the Institute for Natural Language Processing (IMS) at the University of Stuttgart and the Linguistics Department (SfS) at the University of Tübingen. Their project term ended in January 2002. The objective was to broadly and appropriately represent the linguistic reality of present-day German (from 1956 till the end of 2001) and to process these data with the aid of modern methods of corpus linguistics in order to make them available for research. The project was part of the overall project of the task group for corpus technology at the IDS. It directly followed up on the work done on corpus acquisition in the last years and made use of the resources within the IDS.

The IDS undertook the task of acquiring, documenting and converting the texts as well as electronically encoding document structures. Bibliographic information was added using the Corpus Encoding Standard (CES). The University of Tübingen created tools for the morphosyntactic annotation of the texts and the University of Stuttgart worked on the development of research tools.

The assignment of the IDS was to acquire and process text material with the widest possible range of text types up to one billion words. The text material acquired in electronic format contains twelve regional and trans-regional newspapers in total from Germany, Austria and Switzerland. They all include several volumes, various journals, over 250 belletristic book titles and non-fiction books as well as a collection of texts on politics, law and science. Some of the materials were released for use during the project duration after they had been converted, edited and documented as a corpus. Altogether corpora with a volume of about 993 million text words were created.

The last phase of the project was characterised by finding legal solutions to copyright issues. These make it more and more complicated to allow the unrestricted online use of electronic text corpora. Although good results are achieved, restrictive copyright management and high financial claims make it almost impossible to keep the data constantly updated.

The project has shown that corpus acquisition, corpus annotation as well as corpus documentation has to be considered an ongoing task. Only then the change of the German language in lexicon and also grammar can be continuously recorded and explored.

In addition to that the project has shown that the communicative, legal, text linguistic, technological and bibliographic work on the construction of a reference corpus as mentioned above entail high personnel expenses.

Head of Project:

Cyril Belica <belica@ids-...>

Ulrike Haß-Zumkehr <zumkehr@ids-...>

Scientific Staff Members:

Brigitte O. Endres

Scientific Assistants:

Christian Weiß <weiss@ids-...>

Student Assistants:

Mirko Ganz

                               

 Sitemap     Suche     Impressum     Kontakt    Drucken