Corpus Search, Management and Analysis System

Cyril Belica: The Overall Corpus Linguistic Concept of the COSMAS Platform

1. Scientific Methodological Premises, Principles and Approaches of Empirical Anchoring of Corpus-based Linguistic Investigations

  • Principle of minimal assumption (1991)

    • minimum assumption -- J.Sinclair
    • Low hypothesis corpus survey
    • A Posteriori linguistic interpretation
    • Methodological and technical consequences
      • Language independence
      • Managing annotations of discontinuous text areas
      • Simultaneous management of any number of, even competing, annotation layers
      • ambiguous, parameterisable tokenisation (1993)
      • corpus-appropriate, norm-independent lemmatisation (1994)
  • Principle of very large corpora

    • more data is better data -- R. Mercer / K. Church
    • as an indispensable empirical basis for the observation of language usage
    • 28 million words of text in 1991
    • over 2 billion text words in 2005
    • technical consequences
      • several adaptive indexing methods (1997)
      • incremental indexation (1993)
      • hardware and software parallelization (1998)
      • results cache (1999)
      • optimization of Proximity Logic (1997)
  • Principle of innocence of copyright

    • empirical text material fully protected by copyright

  • Principle of the virtual corpora (1991)

    • Representativeness
      • in the phase of the acquisition of the corpus, representativeness is not sought, but
        • stratification
        • quantity
        • integrity of copyright
        • extratextual documentation
      • user-defined representativeness is achieved in the corpus use phase
      • through dynamic user composition of virtual corpora
    • Composition of virtual corpora based on
      • text-external criteria (1992)
      • text-internal criteria (1992)
      • distributional properties
    • Monitor Corpora (1993)
    • Technical consequences
      • predefined corpora (1991)
      • user-definable corpora (1992)
        • save (1992)
        • load (1992)
        • naming (1992)
  • Sampling principle (1992)

    • reproducibility of the results
    • extrapolability of the results
    • technical consequences
      • random selection of corpus texts (1992)
      • random selection of matches (1993)
  • Analysis paradigm instead of consultation paradigm (1994)

    • Identifying recurrent constituents of language use from empirical language data
    • lexical, syntactic and semantic analysis not separated from each other
    • investigation of probabilistic, preference-relational structures
    • use of mathematical-statistical, pattern-oriented, inductive and data-driven methods
      • co-competitive analysis and clustering (1995)
      • neologism recognition (1996)
      • contrastive studies (1996) ("omnis determinatio est negatio")
        • methodological and technical implications
          • several virtual corpora activated at the same time
      • multidimensional analysis (1998)
      • autofocusing the context of analysis (1999)
      • identification of syntagmatic patterns (2000)
      • analysis of co-occurrence profiles (2001)
      • release of the web version of the ccdb cookery database (2001)
      • analysis of use aspects (2003)
      • modelling of sematic proximity (2004)
      • hierarchical and topological clustering of co-occurrence profiles (2005)
      • Contrasting of close synonyms (2006)
  • Abstract text model (1993)

    • technical consequences
      • SGML-based (1993)
      • independence of indexing from the external text model (DTD)
      • handling discontinuous text areas
      • annotations are separated from the text, are projected onto the text
      • processing morphosyntactic annotations (1997)
      • text model-sensitive presentation of annotations (see multimedia suitability)
  • Principle of multilingualism (1995)

    • language independence
    • exchangeable language-specific modules
  • IT implementation consistently aligned to the above-mentioned scientific methodological principles

    • Modularity
      • COSMAS Core (1992)
      • language-independent modules (1992)
      • language-specific modules (1992)
      • library of API services for other application programs (1994)
      • batch processing (1996)
    • Client-server concept
      • network capability (1993)
      • web connectivity (1996)
    • line-oriented search query language (1991)
      • logical operators
      • hit-including and hit-excluding distance operators
      • maximum and interval distance
      • lower / upper case
      • lemmatization
      • annotations
      • previous queries
      • previous search results
      • KWIC filtering
      • word form and lemma lists
      • expansion of the search objects
    • graphical drag&drop search query language (1994)
      • syntax-sensitive synoptic nesting of partial search queries
      • search query palette
      • search query macros
    • variable proximity metric (1991)
      • implicit
        • Wortsegment-, Wort-, Satz-, Absatz- und Textmetrik
      • explicit
        • SGML annotations
        • model of time and time metric for audio annotations
    • presentation of results in stages: text overview, concordance (KWIC), voucher (1991)
    • various
      • unified header and text search (1994)
      • results stack (1992)
      • import of word lists and search queries (1993)
      • various export options (1992)
      • graphical representation of chronological information (2000)
      • display of source references with page numbers (1992)
    • multimedia suitability
      • interface to the external SGML/XML viewer
      • interface to the multimedia player
    • organisation of data
      • any number of separate corpus archives (1994)
    • administration
      • user registration (2002)
      • user management (1996)
      • corpus management (1992)
      • administration of access rights (1993)

2. COSMAS-I Publications, Reports, Lectures, Workshops, Presentations and Other Activities

Anmerkung: Eine gelungene Nachzeichnung von prägenden Konturen des korpuslinguistischen Gesamtkonzepts der COSMAS-Plattform der Neunzigerjahre bietet - als Einführung zu ihrer Arbeit über linguistische Klassifikation usueller Wortverbindungen - außerdem

Steyer, Kathrin (2004): Kookkurrenz. Korpusmethodik, linguistisches Modell, lexikografische Perspektiven. In: Steyer, Kathrin (Hrsg.): Wortverbindungen - mehr oder weniger fest. Berlin/New York. (= Jahrbücher des Instituts für Deutsche Sprache, 2003), S. 87-116.

