Methods of Corpus Analysis and Corpus Classification

Topic / Domain Classification of Corpora

Until recently, our former staff member Christian Weiß was responsible for works in this subproject. These activities are currently on hold. If you have any questions concerning this area, please send an email to: korpuslinguistik@ids-mannheim.de

The aim of this subproject is the topic / domain classification of corpora on the one hand. On the other, it is to construct thematic virtual sub-corpora and also to disambiguate, for example, readings by analysing field-related frequency distributions. The starting point is the creation of a taxonomy of subject area topics. This is accomplished in a semi-automatic process, which includes the application of text mining (document clustering) and the manual allocation of clusters in an external ontology. The taxonomy thus acquired is suitable for both manual and automatic classification. For automatic classification the “naive Bayes” text classifier is being motivated and evaluated for a classified corpus of almost two billion words.

A detailed description of the subproject is available as a PDF-Document (228 Kb).

For results see:

  • A Catalogue of Topics: A formal and externally anchored ontology of topics
  • EAin Clusterverfahren:: Eine mathematisch-statistische Methode zum automatischen Auffinden von Themen und Belegtexten
  • A Method of Clustering: A mathematical-statistical method for automatically finding topics and text instances
  • A Method of Text Classifikation: A mathematical-statistical method for thematic classification of (so far unannotated) texts
  • Further mathematical-statistical Methods such as keyword extraction or text filtering

In addition, A Survey on Existing Classification Patterns of Other Language Corpora is provided.

Catalogue Topics

A subgoal of the project was the creation of a field-related catalogue with an external topic description, as impartial as possible. In order to minimize the restrictions regarding the topics to be classified, for example, a restriction on scientific topics for a scientific library, the classification was aligned on the basis of a higher ontology, as comprehensive as possible. The Open Directory probably represents the largest existing ontology.

Due to its aspiration to capture and nominate all subject areas, the Open Directory provides a large pool of possible topics and topic descriptions. Several points, however, speak against a direct adoption of this classification scheme:

  • On the one hand, only a fraction of the categories have proven to be interesting. While there is some interest in an adoption of the category “Culture: Film”, there is none in the category “Culture: Film: Film Distribution”.
  • A second argument against a one-to-one takeover was “hidden categories”, i. e. thematically very similar topics with very different top-categories. For example, for gardening topics, differing categories have been named such as:
  • “At Home: Garden and Plants”
  • “Economy: Construction: Gardening and Landscaping”
  • “Science: Natural Sciences: Biology: Botany: Botanical Gardens”
  • “Economy: Consumer Products: House and Garden”
  • Alongside too great a degree of delicacy one could also observe the opposite tendency, i. e. too coarse a grid, for example with literary or religious texts.

The Clustering Method

Clustering is a subform of data mining, or more specifically, of unsupervised machine learning and is applied mainly for explorative data analysis in a variety of disciplines such as biology, empirical social sciences or information retrieval. Document clustering means the automatic grouping of texts with similar content.

This can be demonstrated with the aid of the chart on the left: It shows a hierarchical cluster as categories from the newspaper sector (culture, sports, economy) that has been determined with the help of the clusterer „CLUTO“; the colouring of the fields reflects the prominence of a key word. Moreover, one can draw conclusions about the thematic specificity of a cluster from the colour contrast.

The chart on the right shows an even more delicate division of newspaper data. For the 1998 volume a cluster analysis for “Frankfurter Rundschau” was carried out and the seasonal course of a number of selected topics has thus been visualised. That way, for example, an increase in texts about books can be observed, which can be attributed to the Frankfurt Book Fair. The topic “football” is represented especially strong in the summer, which can be attributed to the “Football World Cup in France”. Thus, clustering provides the set of topics which are in public discussion at a given point in time, or written about in newspapers, giving an opportunity to make current affairs transparent.

Applied to this subproject, a clustering method was chosen, which facilitated the assignment of texts to most of the topics defined in the ontology.

With the help of the above mentioned clusterer CLUTO, all texts of the IDS corpus have been submitted to a clustering method and divided into about 1,500 thematic clusters.

Two manual steps were taken, following the fully automatic clustering:

The first step consisted in a quality control.

Clusters, that did not have thematic homogeneity in the desired sense, have been excluded. If rare topics were concerned such as “equestrian sports” for example, clusters were examined completely, i. e. document by document.

The second step was the annotation, where each cluster was annotated according to its content spontaneously, or according to the above mentioned topic taxonomy.

Thus, a cluster of texts about the illness “aids” from 1985, for example, was marked with the term “aids 85” and the topic area “health_diet: health”.

You can find an overview of the clusters online.

The aim of the last two sections was the motivation of a topic taxonomy as comprehensive as possible as well as its annotation with sample texts, which have been found by the clusterer.

This semi-automatically generated data volume functioned as an entry for a text classifier.

The Classifier

The aim of the last two sections was to motivate as comprehensive a topic taxonomy as possible and to annotate it with sample texts found by the clusterer. This semi-automatically generated data set acted as input for a text classifier. Since the training data was generated semi-automatically and therefore could not be 100 percent correct, a robust classification method was chosen, such as the naive Bayesian classification method.

For the evaluation of the data, precision, i.e. the relative proportion of correct cluster and classification results in relation to the respective overall result, and recall, i.e. the relative proportion of correct classification results in relation to an externally classified data set, were calculated. These data sets were extracted by random selection and included 30 documents per category. In the case of precision, three samplings were carried out: For training data, for data from vintages for which training data is available, and data from vintages for which training data is not available.

The results are available online as:

More results

  • Keyword extraction:

By a keyword I mean a term that appears significantly often in relation to a topic. Keywords therefore allow conclusions to be drawn about the content of text sets "at a glance".

Here is a tabular overview of the 10 most important words per category determined by the χ2 test (chi-square test).

  • Text filtering:

Newspapers, which make up a large part of corpora, contain a great deal of linguistically uninteresting information such as league tables, stock exchange prices, event announcements. In order to prevent the appearance of these undesirable documents, corresponding training data were specified in analogy to the procedure described. Documents belonging to this class are filtered out. With regard to the evaluation, in addition to the precision, the proportion of documents that were incorrectly not categorised as "data rubbish" was also calculated. This proportion is noted in the bottom line of the precision table.

External links

Used software

Only programmes with an open source licence were used for the sub-project. These are:

  • Clusterer: CLUTO
  • Klassifikator: RAINBOW
  • Indexer: MG (Managing Gigabytes)

Contact: corpuslinguistics@ids-mannheim.de

                           

 Sitemap     Suche     Impressum     Kontakt    Drucken