رده‌بندی متون

تغییرات پروژه از تاریخ 1393/03/05 تا تاریخ 1393/04/01
در رده‌بندی متون هدف این است که سندهایی را که در اختیار داریم بتوانیم برچسب‌گذاری موضوعی کنیم. در واقع این موضوع صرفا یک مسئله با ناظر است، یعنی مجموعه‌ای از اسناد متنی که گروه‌بندی موضوعی شده‌اند به عنوان داده‌ی آموزشی در اختیار سامانه قرار می‌گیرد تا بتواند با یادگیری از این مجموعه، اسناد جدید ورودی را به یکی از این گروه‌های موضوعی ملحق نماید.

در این پژوهش روش‌های مختلف رده‌بندی اسناد متنی مورد بررسی قرار گرفته و برای زبان فارسی پیاده‌سازی می‌شوند.

<hr />

[Github](https://github.com/s1na/magpie)

#Abstract#
TODO

#Introduction#
Data is being generated with a rapid rate and with it the need for better web search, spam filtering, content recommendation and etc., which are in the scope of document classification. Document classification is the problem of assigning one or more pre-specified categories to documents.
A typical document classification process consists of training and building a classifier from a training set and then it's given documents to classify. Depending on the training set available, two learning approaches are available:

* **Supervised learning**—This approach became popular and set aside Knowledge Extraction, and  it is the way to solve document classification problems in many domains. In this approach a large labeled set of examples is required to build the classifier.

    Many common classifiers are used such as Naive Bayes, Decision Trees, SVMs, KNNs, etc. Naive Bayes is shown to be a good classifier in term of accuracy and computational accuracy as well as a simple classifier.

* **Semi-supervised learning**—A recent trend is to benefit from both labeled data and unlabeled data. This approach which is somewhere between supervised learning and unsupervised learning is exciting for both theoretical and practical points of view.  Labeled data is difficult, expensive and time consuming to obtain, but unlabeled data is cheap and easy to obtain. In addition, results have shown that one could reach higher accuracy simply by adding unlabeled to their existing supervised classification system.

    Some often-used methods in semi-supervised classification include: EM with generative mixture models, self-training, co-training, transductive support vector machines and graph-based methods.

Hence one of the major decisions when trying to solve document classification problems is choosing the right approach and method. For example bad matching of problem structure with model assumption could lead to lesser accuracy in semi-supervised learning.

Another problem is the ever increasing number of features and variables in these methods. This rises the issue of finding good features which could be more difficult than using those features to create a classification model. Using too many features can lead to overfitting, hinders the interpretability and is computationally expensive.

#Related Works#
Here we study a few methods and compare them.

##Preprocessing##
Before doing any real processing, a few preprocessing tasks should be done to increase effectiveness. Mostly these tasks are Information Retrieval techniques which are also used in document classification as they both process text documents.

Stemming smooths the data set and tries to group similar words and words with the same root. For example imagine the effect of *voting* in the favor of political documents is decreased because in some *voting* is used more and in some *vote*. Although both have the same meaning they are considered different features.

As we'll see in the next section (Dimensionality Reduction), not all terms are worth processing and only a fraction of the whole term set will be chosen to represent the document. A high percentage of the terms in a document are very frequent but useless terms such as pronouns referred to as stop words. Now if we choose highly frequent terms for document representation, stop words will occupy the  features set resulting in a low accuracy.

##Dimensionality Reduction##
Normally in practical and sometimes in academic settings, the high number of terms in the documents causes problems, since some classification algorithms can't handle this. Also, this may cause the classifiers to tune to the training data resulting in a very good accuracy when reclassifying the data they have been trained on, and a much worse accuracy when classifying unseen data. It would take a long time and so on... . There are two approaches to deal with this issue, feature selection and feature extraction. We'll examine these approaches briefly.

###Features Selection###
Feature selection is the problem of selecting a relevant subset ($T ^{\prime}$) of terms ($T$) which yields the highest effectiveness and ignoring the rest. If done right, it will improve the predicting performance, improve the speed and reduce the cost of predictors and decrease the storage requirements. A few methods used to achieve this include: Ranking, filters, wrappers and embedded methods. Depending on the goal one could outperform the others, e.g. if the task is to predict as accurately as possible, an algorithm that has a safeguard  against overfitting might be better than ranking. A major factor in feature selection is the aggressivity $\cfrac{\vert T \vert}{\vert T ^{\prime} \vert}$ of reduction which affects the effectiveness directly. A high aggressivity results in the removal of some useful terms, therefore it should be chosen with care.

To give you a sense of how feature selection works, we'll look at filtering. Filtering works by keeping the terms that score high according to an importance measure. There are a variety of measures available, one of them is document frequency. Using document frequency as a measure, the feature selection algorithm chooses the terms that occur in the highest number of documents. Using this measure it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness.

###Features Extraction###
Feature extraction generates a set of artificial terms to improve the effectiveness. The difference is that the mentioned term set is not a subset of all the terms, and the terms may not be available in the data at all, that's why sometimes this approach is called feature construction. There are a number of feature construction methods including clustering, basic linear transforms, latent semantic indexing and more sophisticated methods. Clustering method replaces a group of similar terms by a cluster centroid, which becomes a feature, mostly through K-means and hierarchical clustering algorithms.

##Supervised Classifiers##

###Naive Bayes###
As mentioned earlier Naive Bayes is a simple probabilistic classifier. The output $P(y \vert x)$ for $y$ being a category in the set of categories ($Y$) and  $x$ being a document in the set of documents ($X$), is the probability of $x$ belonging to class $y$. Simply said, Naive Bayes classifier learns from the number of occurrences of words (features) independently, meaning that it doesn't include the effect of presence of absence of one word on the other, thus decreasing the computations. This assumption also has it's drawbacks.

$$y_{1}, ..., y_{\left\vert{Y}\right\vert} \in Y$$
$$x_{1}, ..., x_{\left\vert{X}\right\vert} \in X$$

We'd compute the category for an unseen document $d$ with the word list $W$ using the following:

$$w_{1}, ..., w_{l_{d}} \in W$$
$$ y^{*}=argmax_{y_{j} \in Y} P(y_{j}) \prod_{i=1}^{l_{d}} P(w_{i} \vert y_{j})$$

Where $P(y_{j})$ is a priori probability of class $y_{j}$ and $P(w_{i} \vert y_{j})$ is the conditional probability of $w_{i}$ given class $y_{j}$. By applying the Laplace law of succession to estimate $P(w_{i} \vert y_{j})$, we'll get:

$$P(w_{i} \vert y_{j}) = \cfrac{n_{ij} + 1}{n_{j} + k{j}}$$

Where $n_{j}$ is the total number of words in class $y_{j}$, $n_{ij}$ is the number of occurrences of word $w_{i}$ in class $y_{j}$ and $k_{j}$ is the vocabulary size of class $c_{j}$.

###Decision Tree###
A decision tree classifier constructs a tree which its nodes represent features. In a binary document representation depending on whether the feature exists or not, a branch exiting from the node is taken. Starting from the root and checking features repeatedly, we'll get to a leaf which represents the corresponding category of the document. In contrast to the naive bayes method mentioned above which is quantitative and hard to interpret for humans, decision tree outputs are symbolic and easily interpreted. Several methods for decision tree classifiers include ID3, C4.5 and C5.

An algorithm for creating a decision tree might be as follows. For category $c_{i}$, at each step a term $t_{k}$ is chosen usually according to an entropy or information gave measure, then the term set is divided to groups by the selected term $t_{k}$ and each group is placed in a seperate subtree, then the procedure is repeated until each leaf of the tree contains training documents assigned to the same category $c_{i}$, which is then chosen as the label for the leaf.

Decision trees are prone to overfitting, since some of the branches may be too specific to the training data. To prevent the classifier from building huge trees with specific trees, setting parameters like maximum depth or minimum observation in a leaf, and pruning techniques are used.

##Semi-supervised Classifiers##

###Self-training###
This method is mostly used for replacing a supervised classifier by a semi-supervised one and enjoying the benefits of semi-supervised algorithm with less effort. First a classifier trained with the small labeled data available and it is used to classify the unlabeled data. At each step, the most promising documents are added to the training set and the classifier is re-trained with the new training set. In another words, the classifier uses its own predictions to train itself. Note that a mistake prediction's effect is doubled if it is chosen to be added to the training set. Some algorithms detect this using some threshold and try to avoid it by unlearning the those documents.

#Experimentsvaluation#
To experiment said methods and compare them with each other, using different parameters and in different situations, we need an evaluation method. To achieve this goal, while classifying the test set, a *confusion matrix* is created. Confusion matrices help show how the classifier performed on predicting and separating the classes. It is a n-n matrix with n being the number of classes. Rows represent the class assigned by experts and columns represent the class predicted by the classifier. Each document classification increments $m[i][j]$ where $i$ is index of the label given by experts, and $j$ is index of the label predicted.

<small style="text-align: center;">Table 1 - A confusion matrix in which rows are reference and cols are test.</small>
<table style="width: 600px; margin: 0 auto;" dir="ltr">
<tr><th></th><th>Siasi</th><th>Vrzsh</th><th>Eqts</th></tr>
<tr><th>Siasi</th><td>24</td><td>1</td><td>5</td></tr>
<tr><th>Vrzsh</th><td>2</td><td>21</td><td>3</td></tr>
<tr><th>Eqts</th><td>7</td><td>2</td><td>17</td></tr>
</table>


Given confusion matrix one can compute several useful metrics to evaluate the classifier. Of the most used metrics *precision*, *recall* and *accuracy* are explained here.

Accuracy is the overall correctness of the model. In the example it is calculated as follows:
$$accuracy = \cfrac{24 + 21 + 17}{24 + 1 + 5 + 2 + 21 + 3 + 7 + 2 + 17}$$
Precision is a measure of the accuracy provided that a specific class has been predicted. In the example, the precision of *Siasi* class is computed as follows:
$$precision(Siasi) = \cfrac{24}{24 + 2 + 7}$$
Recall, also referred to as sensivity, is a measure of the ability of a prediction model to select instances of a certain class from a data set. In the example the precision of *Siasi* is calculated as follows:
$$recall(Siasi) = \cfrac{24}{24 + 1 + 5}$$


#Experiments#

##Basic Implementation##
Using these metrics a few classifiers with a bag of words as document representation are implemented. The dimensionality reduction approach is feature selection, and the selected measure is frequency of terms in the whole corpora. With a little tweaking, a ratio of 10%  for $\cfrac{\vert T ^{\prime} \vert}{\vert T \vert}$ showed to have the best accuracy in the old hamshari corpora. Training with 2000 documents and testing with the next 500 document resulted in what follows.
<table style="width: 600px; margin: 0 auto;" dir="ltr">
<tr><th></th><th>Accuracy</th><th>Precision</th><th>Recall</th></tr>
<tr><th>Naive Bayes</th><td>0.564</td><td>0.510183</td><td>0.308059</td></tr>
<tr><th>Decision Tree</th><td>0.446</td><td>0.301544</td><td>0.244987</td></tr>
<tr><th>SVM(SVC)</th><td>0.33</td><td>0.52518</td><td>0.121543</td></tr>
<tr><th>SVM(LinearSVC)</th><td>0.594</td><td>0.417879</td><td>0.333864</td></tr>
</table>
It's worth saying that no parameter tuning has been done for any of the classifiers, especially no cutoff is set for decision tree and SVMs results are computed using the implementation's default parameters
<small style="text-align: center;">Table 2 - Results of the basic implementation for each classifier.</small>
<table style="width: 600px; margin: 0 auto;" dir="ltr"><tr><th></th><th>Accuracy</th><th>Precision</th<th>Recall</th></tr><tr><th>Naive Bayes</th><td>0.564</td><td>0.510183</td><td>0.308059</td></tr><tr><th>Decision Tree</th><td>0.446</td><td>0.301544</td><td>0.244987</td></tr><tr><th>SVM(SVC)</th><td>0.33</td><td>0.52518</td><td>0.121543</td></tr><tr><th>SVM(LinearSVC)</th><td>0.594</td><td>0.417879</td><td>0.333864</td></tr></table>


It's worth saying that no parameter tuning has been done for any of the classifiers, especially no cutoff is set for decision tree and SVMs results are computed using the implementation's default parameters.

##Optimization##
In the basic implementation, a mixture of nltk/scikit-learn was used with every part of the procedure from nltk toolkits and the classification algorithms which weren't available in nltk from scikit-learn. However, since scikit-learn uses Numpy ndarrays and Scipy sparse matrices for data processing mostly, it has a significant improvement in performance over nltk among higher functionality and ease of use. Also, scikit-learn has much more algorithms included with tools for preprocessing, evaluation and parameter tuning.

Different pipelines and configurations were tested and evaluated until the following configuration showed promising scores. Old hamshahri documents are feature extracted to represent tf-idf of terms in comparison with only representing  whether the document contains the term or not. To exclude stopwords terms with tf > 0.8 aren't added to feature set. As mentioned above feature selection plays an important role in effectiveness of classifiers, therefore 10% of features with the highest chi2 score are selected. Four classifiers used in the first implementation were used again so that comparison wouldn't be meaningless, only this time parameter tuning is done to SVC and LinearSVC classifiers, since their effectiveness varies greatly with different parameters for the same data model. To do this, an exhaustive grid search over suggested parameters is done and this task is repeated. Training with 2000 documents and testing with the next 500 document resulted in what follows.

<small style="text-align: center;">Table 3 - Results of the improved implementation for each classifier.</small>
<table style="width: 600px; margin: 0 auto;" dir="ltr"><tr><th></th><th>Accuracy</th><th>Precision</th><th>Recall</th><th>F1</th></tr><tr><th>Naive Bayes</th><td>0.604000</td><td>0.609943</td><td>0.604000</td><td>0.554812</td></tr><tr><th>Decision Tree</th><td> 0.526000</td><td> 0.542659</td><td>0.526000</td><td>0.527992</td></tr><tr><th>SVM(SVC)</th><td>0.640000</td><td>0.638713</td><td>0.640000</td><td>0.620524</td></tr><tr><th>SVM(LinearSVC)</th><td>0.678000</td><td>0.647120</td><td>0.678000</td><td>0.649800</td></tr></table>

Comparing the results would give us:

<img src="https://i.imgur.com/N6fxLCY.png"></img>

In which you see the tremendous growth in accuracy for SVC than other classifiers, which explains how parameter tuning affected the results. In the case of LinearSVC the default parameters seemed to be the best of them for this data model. Parameters found for SVC are:

    SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0, degree=3,
    gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
    random_state=None, shrinking=True, tol=0.001, verbose=False)

In a different experiment, in addition to term tf-idfs, ngrams of 2 were also added to the feature set, compare them:

<img src="https://i.imgur.com/29mtOLf.png"></img>

Different results are seen through classifiers, with decrease of Naive Bayes' accuracy, and increase of Decision Tree's accuracy. Probably decrease in Naive Bayes' accuracy is due to overfitting on the train data with the increase of feature set size.

[Source code](https://github.com/s1na/magpie)

#References#
* F Sebastiani, "Machine learning in automated text categorization", ACM computing surveys (CSUR), 2002.
* SL Ting, WH Ip and AHC Tsang, "Is Naïve Bayes a Good Classifier for Document Classification?", International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011.
* X Zhu, "Semi-supervised learning literature survey", University of Wisconsin-Madison, 2006.
* J Turian, L Ratinov, Y Bengio, "Word representations: a simple and general method for semi-supervised learning", ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010.
* I Guyon, A Elisseeff, "An introduction to variable and feature selection", The Journal of Machine Learning Research, Vol. 3, 2003.
* Y. H. Li, A. K. Jain, "Classification of Text Documents", The Computer Journal, 1998.
* R Kohavi, GH John, "Wrappers for feature subset selection", Vol. 97, Artificial Intelligence, 1997.
* Berry, Michael W., ed. Survey of Text Mining I: Clustering, Classification, and Retrieval. Vol. 1. Springer, 2004.

<hr />

#Useful Links#
+ [پردازش زبان فارسی در پایتون](http://www.sobhe.ir/hazm)
+ [پیکره فارسی همشهری](http://ece.ut.ac.ir/dbrg/hamshahri/fadownload.html)
+ [دسته‌بندی متون با استفاده از یادگیری ماشین، محسن رحیمی، پروژه هوش مصنوعی، سال ۹۱](http://bayanbox.ir/id/4963720756402634418?download)
+ [راهنمایی برای استخراج ویژگی از متن زبان طبیعی](http://pyevolve.sourceforge.net/wordpress/?p=1589)
+ [رده‌بندی متون با استفاده از کتابخانه Scikit-learn](http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html)
+ [Natual Language Processing Course - Text Classification](https://class.coursera.org/nlp/lecture/preview)