در ردهبندی متون هدف این است که سندهایی را که در اختیار داریم بتوانیم برچسبگذاری موضوعی کنیم. در واقع این موضوع صرفا یک مسئله با ناظر است، یعنی مجموعهای از اسناد متنی که گروهبندی موضوعی شدهاند به عنوان دادهی آموزشی در اختیار سامانه قرار میگیرد تا بتواند با یادگیری از این مجموعه، اسناد جدید ورودی را به یکی از این گروههای موضوعی ملحق نماید.
در این پژوهش روشهای مختلف ردهبندی اسناد متنی مورد بررسی قرار گرفته و برای زبان فارسی پیادهسازی میشوند.
Abstract
TODO
Introduction
Data is being generated with a rapid rate and with it the need for better web search, spam filtering, content recommendation and etc., which are in the scope of document classification. Document classification is the problem of assigning one or more pre-specified categories to documents.
A typical document classification process consists of training and building a classifier from a training set and then it's given documents to classify. Depending on the training set available, two learning approaches are available:
Supervised learning—This approach became popular and set aside Knowledge Extraction, and it is the way to solve document classification problems in many domains. In this approach a large labeled set of examples is required to build the classifier.
Many common classifiers are used such as Naive Bayes, Decision Trees, SVMs, KNNs, etc. Naive Bayes is shown to be a good classifier in term of accuracy and computational accuracy as well as a simple classifier.
Semi-supervised learning—A recent trend is to benefit from both labeled data and unlabeled data. This approach which is somewhere between supervised learning and unsupervised learning is exciting for both theoretical and practical points of view. Labeled data is difficult, expensive and time consuming to obtain, but unlabeled data is cheap and easy to obtain. In addition, results have shown that one could reach higher accuracy simply by adding unlabeled to their existing supervised classification system.
Some often-used methods in semi-supervised classification include: EM with generative mixture models, self-training, co-training, transductive support vector machines and graph-based methods.
Hence one of the major decisions when trying to solve document classification problems is choosing the right approach and method. For example bad matching of problem structure with model assumption could lead to lesser accuracy in semi-supervised learning.
Another problem is the ever increasing number of features and variables in these methods. This rises the issue of finding good features which could be more difficult than using those features to create a classification model. Using too many features can lead to overfitting, hinders the interpretability and is computationally expensive.
Related Works
Here we study a few methods and compare them. It should be noted that currently some preprccessing steps are omitted to keep the document concise.
Feature Selection
Feature selection is the problem of selecting a relevant subset of features and ignoring the rest. If done right, it will improve the predicting performance, improve the speed and reduce the cost of predictors and decrease the storage requirements. A few methods used to achieve this include: Ranking, filters, wrappers and embedded methods. Depending on the goal one could outperform the others, e.g. if the task is to predict as accurately as possible, an algorithm that has a safeguard against overfitting might be better than ranking.
Naive Bayes
As mentioned earlier Naive Bayes is a simple probabilistic classifier. The output P(y|x) for y being a category in the set of categories (Y) and x being a document in the set of documents (X), is the probability of x belonging to class y. Simply said, Naive Bayes classifier learns from the number of occurrences of words (features) independently, meaning that it doesn't include the effect of presence of absence of one word on the other, thus decreasing the computations. This assumption also has it's drawbacks.
We'd compute the category for an unseen document d with the word list W using the following:
Where P(yj) is a priori probability of class yj and P(wi | yj) is the conditional probability of wi given class yj. By applying the Laplace law of succession to estimate P(wi | yj), we'll get:
Where nj is the total number of words in class yj, nij is the number of occurrences of word wi in class yj and kj is the vocabulary size of class cj.
Experiments
PLACEHOLDER
References
SL Ting, WH Ip and AHC Tsang, "Is Naïve Bayes a Good Classifier for Document Classification?", International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011.
X Zhu, "Semi-supervised learning literature survey", University of Wisconsin-Madison, 2006.
J Turian, L Ratinov, Y Bengio, "Word representations: a simple and general method for semi-supervised learning", ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010.
I Guyon, A Elisseeff, "An introduction to variable and feature selection", The Journal of Machine Learning Research, Vol. 3, 2003.
Y. H. Li, A. K. Jain, "Classification of Text Documents", The Computer Journal, 1998.
R Kohavi, GH John, "Wrappers for feature subset selection", Vol. 97, Artificial Intelligence, 1997.
Berry, Michael W., ed. Survey of Text Mining I: Clustering, Classification, and Retrieval. Vol. 1. Springer, 2004.