White Paper: Automatic Genre Identification

Automatic Genre Identification – Testing with Noise

by Efstathios Stamatatos, Serge Sharoff, Marina Santini – Copyright © 2012, All rights reserved.

Citation: Stamatatos E., Sharoff S., Santini M. (2012). Automatic Genre Identification – Testing with Noise. [White paper]. Retrieved from http://www.forum.santini.se/2012/03/white-paper-automatic-genre-identification-testing-with-noise/

The genre collections used in the experiments are available here. The reference list is here.

In the experiments described below, genre classes coming from three genre collections have been used: Santinis7 (Santini, 2007). KI-04 (Meyer zu Eissen and Stein, 2004), and HGC (Stubbe and Ringlstetter, 2007). These genre collections have been created by different people, in different universities, for different purposes, with different criteria, and different notions of what genre is. Since genre is a complex concept and genre classes can be characterized in different ways, we assume that having a AGI algorithm that can cope with this diversity is much more useful in the real world than experimenting with a single artificially-consistent genre collection. Regardless their different origin and assumptions, some of the genres can be mapped into each other. For instance, private homepages (PHP) in Santinis7 and portrayal-private in KI- 04 are considered to belong to the same genre class, based on the descriptions that the collection’s builders have provided.

The genre collections have been divided into several sets – Set 1, Set 2 and Set 3 – in order to assess how classifiers react to differing quantity of noise. Moreover, we split Set1 into two subsets: Set1.1 containing 90% of the texts per genre and Set1.2 containing the remaining 10% per genre. Set 3 includes genres irrelevant with Set1 and Set2, i.e. Frontpage and Editorial from Santinis7, Article, Download and Portrayal-Non Private from KI-04, and Article, Commentary, Interview, News, Poetry, Recipe, Scientific from HGC. A breakdown of the sets, the collections, the genres and the number of documents per genre is shown in Table 1.

Note that DISC, LIST, and PHPS correspond to Discussion, Listing, and Private home pages, respectively. 3.2 Method and Features A number of recent studies have exhibited the effectiveness of character n-grams in AGI, e.g., (Kanaris and Stamatatos, 2009; Mason et al., 2009). This kind of features is able to capture a variety of stylistic information. Moreover, their extraction from text is very simple and accurate for any natural language. For all these reasons, these experiments are based on character n-gram representation. The classification model we used is the Common N-Grams (CNG) method described in (Keselj et al., 2003). In short, CNG applies a cumulative approach to the representation of each class. That is, the training texts per genre are concatenated in a single file and a profile is extracted from this file to represent the genre properties. The profile is defined as the L most frequent character n-grams found in the concatenated file. Similarly, for a text of unknown genre belonging to the evaluation set a profile of its L most frequent n-grams is extracted. In these experiments, we used L=5,000 and n=3 based on preliminary experiments. In other words, the profile of a text consists of the 5,000 most frequent character 3-grams in this text.

It has to be noted that character n-grams can capture any kind of information including topic, with the value of n being very important. Short n-grams (n < 4) are less likely to capture topical information. The number of features in the method used in our experiments is independent of the number of documents. If we had 1 million documents, the number of features would be exactly the same, i.e. we would have selected 5,000 features.

The similarity between a text x and a genre y is calculated based on the following distance metric:

where Pr(x) and Pr(y) are the profiles of texts x and y (the concatenation of all training texts of this particular genre) while fx(g) and fy(g) are the relative frequencies of the n-gram g in these profiles. Finally, the most likely genre is the one with the smallest distance from x:

where G is the the genre palette of the training corpus. The CNG method is very easy to follow and requires minimal training time (actually, training merely comprises the extraction of genre profiles). A variation of this method (based on a slightly different definition of profiles) has been used in AGI providing very good results (Mason et al., 2009). Moreover, since it is a similarity-based method, it enables us to build open-set classifiers, that is, classifiers that do not necessarily assign a text to a genre of the training corpus but may assign some texts to the IDONTKNOW class (a real-world scenario). This decision depends on the smallest distance of an unknown text from all the genres. To be able to define a threshold we first need a normalization of the distance metric. Here, we used the following metric:

This normalized distance ensures that the calculated distance between two text profiles is between 0 and 1 inclusive. For the open-set classification scheme a threshold is defined as follows: given that D1 is the shortest distances of a given document from all the genre profiles of the training text and D2 is the second shortest distance, the threshold is:

that is each document satisfying this criterion is assigned to the genre corresponding to D1, otherwise to the IDONTKNOW class. A problem can arise when we have just a few documents for a genre in the training set (Stamatatos, 2007). Then, imbalanced profiles are produced (for some genres we may not find 5,000 trigrams) and the distance measure becomes unstable (documents tend to be classified to the shortest profile). However, in AGI this is not a likely situation since usually there is plenty of documents available for each genre.

Experiments

In this section, we describe a series of experiments that are based on various combinations of noise and classification schemes. In more detail, we denote N1 and N2 to be the two types of noise, i.e. 1) the training sample and test sample come from different sources/annotators; 2) the test set contains genre classes that are not present in the training set. More specifically, N1 is the case of two different (not homogeneous) sets, while N2 is the case of unknown genres. In addition, the closed-set classification scheme (i.e. each document must be assigned to one genre from the training genre palette) is denoted as CS, while OS stands for the open-set classification scheme (i.e., the IDONTKNOW answer is valid).

Experiment 1 (Noise-free, CS): In the first experiment we used Set1.1 for training and Set1.2 for testing. This is the traditional ML approach for evaluating classifiers and has been followed extensively in AGI so far. Although most previous experiments use cross-validation instead of a single test set, our results are in line with cross-validated experiments. We prefer a specific test set here to allow direct comparisons with other experiments using exactly the same test set.

Experiment 2 (Noise-free, CS): A mixed training set comprising samples of genres taken from different collections was formed (but without intersection between the genre classes of the three collections). The biggest part of Set1+Set2 was used for training, while for testing we employed 20 texts per genre (10 texts from Set1 + 10 texts from Set2, not used in training).

Experiment 3 (N1, CS): The first type of noise is introduced. That is, the training and test samples for the same genres were taken from different collections. In detail, we used Set1.1 for training and Set2 for testing.

Experiment 4 (N1, CS): We used Set2 for training and Set1 for testing (i.e. the opposite of experiment 3).

Experiment 5 (N1,N2,CS): In this experiment, we have many documents that do not belong to any genre of the training corpus. In detail, we used Set1.1 for training and Set2 plus Set3 for testing in a closed-set classification scheme, that is all the texts of Set3 should be assigned to a genre of Set1.

Experiment 6 (N1,N2,OS): This experiment also examines both types of noise (we used the same training and test sets as in experiment 5), but in the framework of an open-set classification scheme that enables some documents to be assigned to the IDONTKNOW class. This is the most realistic scenario.

Results

Table 2 shows the performance of the six experiments in terms of Precision, Recall and F-measure.

Table 2: Results for the six experiments

Results of Experiment 1 are good and close to those reported in other papers. Previous experiments show that using cross-validation rather than a single test set is more profitable. This experiment confirms that when we set up a closed-set scenario with a noise-free evaluation set the performance of AGI is high.
In Experiment 2 the results are less accurate than Experiment 1 but not discouraging. This is reasonable because the training and test sets are not quite homogeneous since they contain a mix of collections on the same genres. This shows that different collections represent different aspects of the same genre.
Experiment 3 shows that if we perform a more realistic evaluation the accuracy drops dramatically in both recall and precision. The representation of genres in a given collection is not adequate to capture the properties of the same genres from another collection.
Experiment 4 confirms the above conclusion. Note that there is a decrease in performance, if compared to Experiment 3. It should be stressed that the test set is now bigger so errors are more likely to happen. In addition, the Set1 is based mainly on Santinis7 collection while Set2 is based mainly on KI-04 collection. In previous studies it been shown that AGI on KI-04 is harder than AGI on Santinis7 (Kanaris and Stamatatos, 2009).
In Experiment 5, another realistic condition is considered. Now, we have many documents that do not belong to any genre of the training corpus. Note that the incorporation of Set3 in the evaluation set affects the precision only since recall has to do with the documents of the evaluation set belonging to the known genres. It has to be stressed that as the incorporation of documents belonging to unkwown genres increases, precision decreases proportionally So far, all the experiments were based on closed-set classification, i.e. a genre must be assigned to each document.
In Experiment 6, we used the same training and test sets as in E 5 but within an open-set classification scheme. The precision is now improved and slightly worse than Experiment3. This shows that precision can remain high despite the incorporation of unknown genres in the evaluation corpus. On the other hand, recall is worse than Experiment 5.

Table 3: Distance measurement thresholds

The threshold of ε= 0:002 of the open-set classifier was derived empirically among several other candidates based on preliminary experiments to maximize the F-score, see Table 3 (ε = 0:0000 corresponds to the closed-set classifier). By increasing the threshold value it is possible to further increase the precision with the cost of further reducing the recall.

Discussion

The experiments described in this white paper show that AGI research needs to approximate more realistic scenarios. For a number of years, AGI has been tested on individual small genre collections. This controlled scenario returns good results (up to 97% accuracy), but it is basically pointless in practical terms, since we are unable, so far, to approximate or hypothesize on the distribution and the proportion of genre on the web or in any other large digital environment. Experiments 1-5 show that in a closed set condition, performance decreases proportionally to the introduction of noise. The performance achieved in the traditional ML setting of experiment 1 cannot be achieved in experiments 2 and 3, 4 and 5. F-measure drops regularly of more than 10 points per experiment. N1 affects both recall and precision while N2 affects only precision. Both types of noise are likely to occur in every realistic application of AGI technology. Experiments 5 and 6 represent, in our view, the most realistic scenario of this set of experiments including both types of noise. This scenario depicts a situation where the need is to classify a limited number of classes against a larger number of classes that are not represented in the training set. The size of the test set is three times larger than the training set and the number of unknown classes is the double of those represented in the training set. The performance is in line with other experiments with noise (anonymysed) and with the preliminary assessment carried out on WEGA (Santini and Rosso, 2008). In experiment 5 it is shown that the closed-set classification scheme is not appropriate for AGI since the precision of the classifier drops proportionally to the size of the unknown genres. In experiment 6 we show that an open-set classifier is able to maintain a reasonable precision despite the presence of unknown genres. One obvious comment on these results is about features: the features employed are not discriminating against IDONTKONW classes. Also, character n-grams can include some topic-specific keywords, thus distorting the picture by reducing genre classification to a topic classification task. On the other hand, POS trigrams used in other studies are less prone to overfitting with respect to topics, but their performance is well below what can be achieved with character n-grams. From another point of view, this type of experiments can be used to evaluate the quality of existing genre collections. The results of experiments 3 and 6 show that in a real-world application the genre labels DISC (from KI-04), SHOP and PHPS (from Santinis7) are reliable since they maintain a relatively high precision and a reasonable recall. On the other hand, the genre label FAQS (from Santinis7) is too specific and therefore useless since it achieves high precision but very low recall. Furthermore, the genres LIST and BLOG (from Santinis7) are too general and unreliable since their precision is affected dramatically from the presence of unknown genres in the evaluation corpus.