Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. For example, news stories are typically organized by subject categories (topics) or geographical codes; academic papers are often classified by technical domains and sub-domains; patient reports in health-care organizations are often indexed from multiple aspects, using taxonomies of disease categories, types of surgical procedures, insurance reimbursement codes and so on. Another widespread application of text categorization is spam filtering, where email messages are classified into the two categories of spam and non-spam, respectively.
Contents |
Instead of manually classifying documents or hand-crafting automatic classification rules, statistical text categorization uses machine learning methods to learn automatic classification rules based on human-labeled training documents.
A free-text document is typically represented as a feature vector \(x= (x^{(1)}, \ldots, x^{(p)}) \ ,\) where feature values \( x^{(i)} \) typically encode the presence of words, word n-grams, syntactically or semantically tagged phrases, Named Entities (e.g., people or organization names), etc. in the document. A standard method for computing the feature values \( x^{(i)} \) for a particular document \( d \) is called the bag of words approach. A specific version of this approach is the TF-IDF term weighting scheme. In the simplest form \( x^{(i)} = TF(i,d) \cdot IDF(i) \ ,\) where TF(i,d) (term frequency) is the number of times term \( i \) occurs in document \( d \ .\) IDF(i) \( = \log\frac{N}{n_i} \) is the Inverse Document Frequency, where \( N \) is the total number of documents in a collection, and \( n_i \) is the number of documents that contain term \( i \ .\) To make documents of different lengths comparable, each feature vector \(x\) is typically normalized to Euclidian length 1, dividing each feature value by the Euclidean length \(||x||=\sqrt{x \cdot x}\) of the original vector. Note that the resulting feature vectors are typically high-dimensional (\(10^{4}\) to \(10^{5}\)) but sparse (\(10^{2}\) to \(10^{3}\) non-zero coordinates), advocating methods that exploit sparsity. Sometimes feature selection methods (e.g. selecting the features with highest document frequency or mutual information) are used to reduce dimensionality.
It is useful to differentiate text classification problems by the number of classes a document can belong to. If there are exactly two classes (e.g. spam / non-spam), this is called a “binary” text classification problem. If there are more than two classes (e.g. positive / neutral / negative) and each document falls into exactly one class, this is a “multi-class” problem. In many cases, however, a document may have more than one associated category in a classification scheme, e.g., a journal article could belong to computational biology, machine learning and some sub-domains in both categories. This type of text classification task is called a 'multi-label categorization problem. Multi-label and multi-class tasks are often handled by reducing them to \(k\) binary classification tasks, one for each category. For each such binary classification tasks, members of the respective category are designated as positive examples, while all others are designated as negative examples. We will therefore focus on binary classification in the following.
A training sample of classified documents for a binary classification task can be denoted as \(S = \{ ( x_1 , y_1 ), \ldots, ( x_n , y_n ) \} \) where \( n \) is the number of training examples, and \(y_i \in \{-1,+1\}\) indicates the class label of the respective document. Using this training sample, a supervised learning algorithm aims to find the optimal classification rule, i.e., a function mapping from the p-dimensional feature space to the one-dimensional class label. Optimal generally means that the classification rule can both accurately classify training documents and generalize well to new documents beyond the training set. More formally, the training sample \(S\) is typically assumed to be an independently, identically distributed sample from some unknown probability distribution \(P(x,y)\) that models the creation process of documents and labels. The generalization accuracy of a classification rule \(h\) can then be characterized by the “risk” \(R(h) = \int L(h(x),y) d P(x,y)\ ,\) where L(h(x),y) is a loss function that returns a numerical score indicating how bad it is if the rule predicts \(h(x)\) while the true label is \(y\ .\) A commonly used loss functions is the 0/1-loss, which returns 0 if the prediction matches the true label, and 1 otherwise. In this case, the risk is equal to the probability that the classification rule will misclassify an example.
While decision trees, logical rules, and instance-based rules have been explored for text classification, the most commonly used type of classification rules are linear rules. They take the form \(h_\theta(x) = sign(\theta^{(1)} x^{(1)} + \ldots + \theta^{(p)} x^{(p)}) = sign(\theta \cdot x)\ ,\) where \(\theta\) is a weight vector with one weight for each feature. Often an additional feature that always takes value 1 is added to simulate a constant offset. <sign(.)> is the sign function, returning +1 for positive arguments and -1 for non-positive arguments. Geometrically, linear rules correspond to a hyperplane in the vector space of feature vectors, classifying documents by which side of the hyperplane they fall on.
Representative examples of linear classifiers include linear Support Vector Machines (SVM), regularized logistic regression, ridge regression (i.e., regularized least squares fit), Naïve Bayes methods and boosted linear classifiers (Yang, 1992, 1994 and 1995; Lewis, 1994; Joachims, 1998, 2002; McCallum and Nigam, 1998; Schapire & Singer, 2000; Zhang and Oles, 2000; Li and Yang, 2003). It has been empirically observed that linear classifiers with proper regularization are often sufficient for solving practical text categorization problems, with performance comparable or better than non-linear classifiers. Furthermore, linear methods are generally computationally efficient, both at training as well as at classification.
The Naïve Bayes is a generative training method that learns a model of the distribution \(P(x,y)\) from the training sample \(S\ .\) Naïve Bayes is “naïve” in that it assumes conditional independence between all feature values in a feature vector. Which this assumption is clearly violated to text classification, Naïve Bayes nevertheless produces fairly accurate classification rules in many cases.
The other methods mentioned above are discriminative learning methods. They do not build a model of \(P(x,y)\ ,\) but directly search hypothesis space \(H\) (e.g. the set of all linear rules) for a classification rule that has low training error (i.e. empirical risk) \[R_{train}(h) = \frac{1}{n} \sum_{i=1}^{n} L(h(x),y)\ .\] The training objective of many of these methods (e.g. SVMs, ridge regression, logistic regression) can be brought into the following form: \[\tag{1} \hat{\theta}= \arg\min_{\theta } \{ \sum_{i = 1}^n L_\theta(x_i, y_i) +\lambda G(\theta)\},\quad \lambda \ge 0 \ .\]
Term \(G(h_\theta) \) is called the ’’regularization term’’, measuring the complexity of any rule \(h_\theta\ .\) Exact definitions of the empirical risk term and the regularization term may differ in various classification methods. Generally speaking, the two terms tend to be negatively correlated: a low-complexity model often has high empirical risk (as a result of under-fitting the data) while a high-complexity model tends to overfit the data (i.e. it has low empirical risk but high prediction error). The parameter λ balances empirical risk with model complexity. The optimal value of λ is typically determined via cross-validation.
To analyze the differences and similarities among popular classifiers, let us look at three of the more successful linear classification methods in text categorization as examples: linear SVM, ridge regression and regularized logistic regression. Interestingly, the three methods have the same regularization term, i.e., \(G(\theta) = ||\theta||^2=\theta \cdot \theta\ .\) Thus the differences among these methods only come from their loss functions. The linear SVM uses a hinge loss defined as: \[\tag{2} L^{svm}_\theta(x_i, y_i) = \max (1- y_i x_i\theta, 0)\ .\]
The ridge regression (and linear least squares fit) uses the squared loss: \[\tag{:label exists!} L^{square}_\theta(x_i, y_i) = (x_i \theta - y_i)^2\ .\]
The logistic regression method uses the estimated conditional probability of the target label given the input to measures the empirical loss: \[\tag{:label exists!} L^{lr}_\theta(x_i, y_i)=\log (1+e^{-y_i x_i \theta }) \ .\]
Although these loss functions have different theoretical properties (Hastie et al., 2001) and require different algorithms for training (Joachims, 2002; Zhang and Oles, 2000), empirical evaluations of these methods on benchmark datasets have showed similar performance when the trade-off between empirical risk and regularization were properly tuned through cross validation ( Li & Yang, 2003). Those experiments also showed that removing the regularization term in the objective functions of these classifiers resulted in significant performance degradation.
Non-linear classifiers have been successfully applied to text categorization, including k-nearest neighbor methods (kNN), SVM with non-linear kernels, Boosting, decision trees, and neural networks with hidden layers. Like SVMs with non-linear kernels, Boosting (Schapire & Singer, 2000) can be viewed as a linear method after a non-linear feature transformation \(\phi\) that depends on the class of base learners, with \(\lambda=0\) and loss \( L^{exp}_\theta(x_i, y_i))=e^{-y_i \phi(x_i) \theta} \ .\) Empirical evaluations have shown the performance of those non-linear classifiers comparable to stronger linear classifiers (Lewis, 1994; Yang, 1994, 1999; Wiener et al., 1995; Joachims, 1998, Li & Yang, 2003). Among those, the kNN approach is most common due to its simplicity and a prediction accuracy that is frequently competitive with regularized linear classifiers. Known as an instance-based or lazy learning method, kNN uses the k nearest neighbors of each new document in the training set to estimate the local likelihood of each category for the new document. The model complexity is controlled by the choice of k. Formally, the degree of freedom of a kNN classification model is defined as \( d.f. = \frac{n}{k}\) where n is the number of documents in the training set. When k = 1, kNN has the maximum model complexity and tends to overfit to the training set; when k increases, the model complexity reduces accordingly. Just as in the case of linear classifiers, it is important to find the right trade-off between empirical risk and model complexity for non-linear classifiers as well.
Text classification rules are typically evaluated using performance measures from information retrieval. Common metrics for text categorization evaluation include recall, precision, accuracy and error rate and F1. Given a test set of N documents, a two-by-two contingency table with four cells can be constructed for each binary classification problem. The cells contain the counts for true positive (TP), false positive (FP), true negative (TN) and false negative (FN), respectively. Clearly, N = TP + FP + TN + FN. The metrics for binary-decisions are defined as:
Due to the often highly unbalance number of positive vs. negative examples, note that TN often dominates the accuracy and error of a system, leading to miss-interpretation of the results. For example, when the positive examples of a category constitute only 1% of the entire test set, a trivial classifier that makes negative predictions for all documents has an accuracy of 99%, or an error of 1%. However, such a system is useless. For this reason, recall, precision and F1 are more commonly used instead of accuracy and error in text categorization evaluations.
In multi-label classification, the simplest method for computing an aggregate score across categories is to average the scores of all binary task. The resulted scores are called macro-averaged recall, precision, F1, etc. Another way of averaging is to sum over TP, FP, TN, FN and N over all the categories first, and then compute each of the above metrics. The resulted scores are called micro-averaged. Macro-averaging gives an equal weight to each category, and is often dominated by the system’s performance on rare categories (the majority) in a power-law like distribution. Micro-averaging gives an equal weight to each document, and is often dominated by the system’s performance on most common categories. The two ways of measuring performance are complementary to each other, and both are informative.
Other evaluation metrics include utility functions defined over weighted TP, FP, TN and FN, and rank-based metrics like Mean Average Precision (MAP) which evaluates the ability of a system to rank categories for a given document. More details about evaluation metrics, statistical significance tests and benchmark datasets can be found in the literature (Yang and Liu, 1999; Sabastiani, 2002; Lewis et al. 2004, [[1]], [[2]], [[3]]).
The scale of real-world text classification applications, both in terms of the number of classes as well as the (often highly unbalanced) number of training examples, poses interesting research challenges. For example, the Yahoo! taxonomy (2004 version) for web page classification contains nearly 300,000 categories over a 16-level hierarchy (Liu et al. 2006), with a total of approximately 800,000 documents and a vocabulary of over 4 million unique words. Figure 1 shows the power-law correspondence between the category size (in terms of the number of documents) on the X-axis and the number of same-sized categories on the Y-axis. Less than 1% of the categories had more than 100 documents per category at the time the taxonomy was crawled in 2004. This means that if a statistical classifier requires approximately 100 positive training examples per category for learning sufficiently accurate models, then we can only solve the classification problem for only 1% of the categories even if we use the entire set of 800,000 pages as training examples. Clearly, how to effectively learn from relatively sparse training examples is therefore crucial for the true success of text categorization methods, and this has been a challenging research topic. Recent work in multi-task learning, for example, focuses on how to leverage cross-category dependencies, how to “borrow” training examples among mutually dependent categories and how to discover latent structures in a functional space for joint modeling of dependent categories (J Zhang, 2006; T Zhang, 2005, 2007).
When the number of categories reaches the magnitude of tens of thousands or higher, the conventional approach of using all the documents to train a two-way classifier per category is no longer computationally feasible. If the categories are arranged in a hierarchical taxonomy, a natural alternative is to take a divide-and-conquer approach that decomposes the problem of document classification into multi-step nested decisions along the taxonomy, with locally optimized classifier per node (sub-domain of categories) on a subset of training data. Liu et al. (2005) successfully applied this strategy with SVM and kNN classifiers for the training and testing of 132,199 categories in the Yahoo! Web page taxonomy. They found the divide-and-conquer strategy not only addressed the scaling issue but also yielded better classification performance because of local optimization of classifiers based on domains and sub-domains.
Finally, there are interesting challenges in new applications of text classification. While much research has focused on classification into topic categories, other types of classes are interesting as well. For example, one might want to classify documents by sentiment (Pang & Lee, 2002): is a review positive or negative? Or one might want to classify a message as deceptive or not. While representations and methods developed for topic-based classification are applicable to these classification problems as well, special considerations reflecting the nature of the classification task will probably lead to improved performance.
Internal references
K-Nearest Neighbor, Naïve Bayes Classifier, Support Vector Machine, Supervised Learning