استخراج کلمات کلیدی از متن (تحقیقاتی)

تغییرات پروژه از تاریخ 1394/02/27 تا حالا

In this project the implementations have not being done so we are going over varius methods to solve the issue.the proposed methods are in much more details and I tried to explain them in easy way

**1.INTRODUCTION**
 Keyword extraction (KE) is defined as the task that automatically identifies a set of the terms that best describe the subject of document [1, 2-4]. Keywords are meant to serve multiple goals. For example, (1) when they are printed on the first page of a journal article, the goal is summarization. They enable the reader to quickly determine whether the given article is in the reader’s fields of interest. (2) When they are printed in the cumulative index for a journal, the goal is indexing. They enable the reader to quickly find a relevant article when the reader has a specific need. (3) When a search engine form has a field labelled keywords, the goal is to enable the reader to make the search more precise. Different terminology is used in studying the terms that represent the most relevant information contained in the document: key phrases, key segments, key terms or just keywords. All listed synonyms have the same function – characterize the topics discussed in a document [5]. Extracting a small set of units, composed of one or more terms, from a single document is an important problem in Text Mining (TM), Information Retrieval (IR) and Natural Language Processing (NLP). Keywords are widely used to enable queries within IR systems as they are easy to define, revise, remember, and share. In comparison to mathematical signatures they are independent of any corpus and can be applied across multiple corpora and IR systems [1]. Keywords have also been applied to improve the functionality of IR systems. In other words, relevant extracted keywords can be used to build an automatic index for a document collection or alternatively can be used for document representation in categorization or classification tasks [5, 6]. An extractive summary of the document is the core task of many IR and NLP applications include automatic indexing, automatic summarization, document management, high-level semantic description, text, document or website categorization or clustering, cross-category retrieval, constructing domain-specific dictionaries, name entity recognition, topic detection, tracking, etc.  While assigning keywords to documents manually is very costly, time consuming and tedious task, and in addition to that, the number of digital available documents is in growing, automatic keyword extraction attracted the researcher’s interest in the last few years. Although the keyword extraction applications usually work on single documents, keyword extraction is also used for more complex task (i.e. keyword extraction for the whole collection [7], the entire web site or for automatic web summarization [8]). With appearance of big-data, constructing an effective model for text representation becomes even more urgent and demanding at the same time.I would like to present my experimental approach to automatic keywords and keyphrase extraction.
 **Methods**

Various methods of locating and defining keywords have been used, both individually and in concert. Despite their differences, most methods have the same purpose and attempt to do the same thing: using some heuristic (such as distance between words, frequency of word use, or predetermined word relationships), locate and define a set of words that accurately convey themes or describe information contained in the text.

**systematization of methods**
Keyword assignment methods can be roughly divided into two categories: (1) keyword assignment and (2) keyword extraction [6, 7, 11, 22]. Both revolve around the same problem – selecting the best keyword. In keyword assignment, keywords are chosen from a controlled vocabulary of terms or predefined taxonomy, and documents are categorized into classes according to their content. Keyword extraction enriches a document with keywords that are explicitly mentioned in text [18]. Words that occurred in the document are analyzed in order to identify the most representative ones, usually exploring the source properties (i.e. frequency, length) [15]. Commonly, keyword extraction does not use a predefined thesaurus to determine the keywords. The scope of this work is calibrated only on keyword extraction methods. Existing methods for automatic keyword extraction can be divided by Ping-I and Shi-Jen into [19]:  1) Statistics Approaches and 2) Machine Learning Approaches, or slightly more detailed in the four categories proposed by Zahang et al. [15]: 1) Simple Statistics Approaches, 2) Linguistics Approaches, 3) Machine Learning Approaches and 4) Other Approaches. Simple Statistics Approaches comprises simple methods which do not require the training data. In addition, methods are language and domain-independent. The statistics of the words from document can be used to identify keywords: n-gram statistics, word frequency, TF- IDF, word co-occurrences, PAT Tree (Patricia Tree; a suffix tree or position tree), etc. The disadvantage is that in some professional texts, such as health and medical, the most important keyword may appear only once in the article. The use of statistically empowered models may inadvertently filter out these words [19]. Linguistics Approaches use the linguistics feature of the words mainly, sentences and document. Lexical, syntactic, semantic and discourse analysis are some of the most common but complex analysis. Machine Learning Approaches considers supervised or unsupervised learning from the examples, but related work on keyword extraction prefers supervised approach. Supervised machine learning approaches induce a model which is trained on a set of keywords. They require a manual annotation in the learning dataset which is extremely tedious and inconsistent (sometimes requests predefined taxonomy). Unfortunately, authors usually assign keywords to their documents only when they are compelled to do it. Thus induced model is applied for keyword extraction from a new document. This approach includes Naïve Bayes, SVM, C4.5, Bagging, etc. Thus methods require training data, and are often dependent on the domain. System needs to re-learn and establish the model every time when domain was changed [20, 21]. Model induction can be very demanding and time consuming on massive datasets.  Other Approaches for keyword extraction in general combine all methods mentioned above. Additionally, sometimes for fusion they incorporate heuristic knowledge, such as the position, the length, the layout features of the terms, html and similar tags, the text formatting etc. Vector space model (VSM) is well-known and the most used model for text representation in text mining approaches [22, 30, 31]. Specifically, the documents represented in the form of feature vectors are located in multidimensional Euclidean space. This model is suitable for capturing simple word frequency, however structural and semantic information are usually disregarded. Hence, due to the simplicity VSM has several disadvantages [24]: 1) the meaning of a text and structure cannot expressed, 2) each word is independent from other, word appearance sequence or other relations cannot be required, 3) if two documents have similar meaning but they are of different words, similarity cannot computed easily. Graph-based text representation is known as one of the best solutions which efficiently address these problems [24]. Graph is a mathematical model, which enables exploration of the relationships and structural information very effectively. For now, in short, document is models as graph where terms are represented by vertices and relations between terms is represented by edges.
Edge relation between two terms can be established on many principles exploiting different text scope or relations for the graph construction [24, 59]:

1) words co-occurring together in a sentence, paragraph, section or document added to the graph as a clique; 

2) intersecting words from a sentence, paragraph, section or document;  

3) words co-occurring within the fixed window in text; 

4) semantic relations – connecting words that have similar meaning, words spelled the same way but have different meaning, synonyms, antonyms, heteronyms, etc.  

There are different possibilities of network analysis and we will focus on the most common - network structure of the language elements themselves, at different levels: semantic and pragmatic, syntax, morphology, phonetics and phonology. Generally, for this purposes we can study: (1) co-occurence, (2) syntactic and (3) semantic networks [24, 53, 56]. 
**RELATED WORK ON KEYWORD EXTRACTION **
Word Frequency Analysis

Much early work concerned the frequency of term usage in the text, but most of this work focused on defining keywords in relation to a single document. In 1972, the idea of statistically analyzing the frequency of keyword usage within a document in relation to multiple other documents became more common. [7]

This technique, known as Term Frequency - Inverse Document Frequency or simply TF-IDF, weights a given term to determine how well the term describes an individual document within a corpus. It does this by weight ing the term positively for the number of times the term occurs within the specific document, while also weighting the term negatively relative to the number of documents which contain the term. Consider term t and doc- ument d ∈ D, where t appears in n of N documents in D. The TF-IDF function is of the form:

TFIDF(t,d,n,N) = TF(t,d) × IDF(n,N) (1)                                                                                                                 

There are many possible TF and IDF functions. Practically, nearly any function could be used for the TF and IDF. Regularly-used functions include [9]:
(2)

$$\[
T\ F\left(t,d\right)=\left\{\begin{array}{l}1\ \ \ \ \ \ \ if\ t\in{}d \\
\ \ \ 0\ \ \ \ \ \ \ \ \ else\ \ \ \ \ \ \ \ \end{array}\right.
\]$$


(3)$$\[T\ F\left(t,d\right)=\sum_{word\in d}{\left\{ \begin{array}{c}

1\ \ \ \ \ \ \ if\ word=t \\ 

0\ \ \ \ \ \ \ \ \ \ \ \ else\ \ \ \ \ \ \ \ \ \ \  \end{array}\right.}\]$$


Additionally, the term frequency may be normalized to some range. This is then combined with the IDF function. Examples of possible IDF functions include:

(4)$$\[I\ D\ F\left(n,N\right)=log?(\frac{N}{n})\]$$ 
(5)$$\[I\ D\ F\left(n,N\right)={log \left(\frac{N-n}{n}\right)\ }\]$$

Thus, a possible resulting TFIDF function could be:

(6)$$\[T\ F\ I\ D\ F(t,d,n,N)=\left(\sum_{word\in d}{\genfrac{}{}{0pt}{}{1\ \ if\ word=t}{0\ \ \ \ \ else\ \ \ \ \ \ \ \ \ \ }}\right)\]$$ 

When the TF-IDF function is run against all terms in all documents in the document corpus, the words can be ranked by their scores. A higher TF-IDF score indicates that a word is both important to the document, as well as relatively uncommon across the document corpus. This is often interpreted to mean that the word is significant to the document, and could be used to accurately summarize the document [4].
TF-IDF provides a good heuristic for determining likely candidate keywords, and it (as well as various modifications of it) have been shown to be effective after several decades of research. Several different methods of keyword extraction have been developed since TF-IDF was first published in 1972, and many of these newer methods still rely on some of the same theoretic backing as TF-IDF. Due to its effectiveness and simplicity, it remains in common use today [8].
**Word Co-Occurrence Relationships**

While many methods of keyword extraction rely on word frequency (either within the document, within the corpus, or some combination of these), various possible problems have been pointed out with these metrics [12] [5], including reliance on a corpus, and the assumption that a good keyword will appear frequently within the document but not within other documents within the corpus. These methods also do not attempt to observe any sort of relationship between words in a document.  
**Using a Document Corpus**

One attempt at using this extra information utilizes a Markov Chain which is used to evaluate every word in the corpus of all documents [12]. This technique defines a Markov Chain for document d and term t with two states (C, T) where the probability of transitioning from C to T is the probability that the given term was observed in documents d out of all documents (effectively the number of times that t occurs in d divided the number of times t occurs in all documents), while the probability of moving from T to C is the probability that the term was observed out of all terms in d (the number of times t occurs in d divided by the number of term occurrences in d. Conceptually, if two terms arrive at the same state with similar regularity, they are related. The authors of this technique determined that a word is less likely to be descriptive of the document if it arrives at the same state with a similar frequency to many other words in the document (called the background distribution), while it is more likely to be descriptive of the document if it diverges the most from the background distribution. This technique was shown to match, and regularly beat, TF-IDF in terms of precision when run over a corpus of document abstracts from ACM [12].
**Frequency-Based Single Document Keyword Extraction**

Most methods of keyword extraction rely on using some method of comparing a document to a corpus to determine which words are most unique to an individual document. This measure becomes more difficult to use when the corpus is small, non-existent, or of a similar subject and composition.

One method developed by Matsuo and Ishizuka [5] to extract keywords from a single document uses word co-occurrence to build a co-occurrence matrix such as the one in Table 1. When using this method, two words are said to co-occur if they are both observed in a section of text delimited by a punctuation mark (effectively a sentence). In the given example, we can see that words a and d occur in the same sentence a total of 2 times in the document.


| ستون 1 | ستون 2 | ستون 3 | ستون 4 | ستون 5 |
| ------ | ------ | ------ | ------ | ------ |
|        | a      | b      | c      | d      |
| a      |        | 4      | 10     | 2      |
| b      | 4      |        | 35     | 1      |
| c      | 10     | 35     |        | 24     |
| d      | 2      | 1      | 24     |        |
									                                                                                                                 Table1: Example co-occurrence Matrix
The authors postulate that words are important to the document if they co-occur with other words more often in the document than they would if every instance of the word were randomly distributed. For some word$$\[w_i\]$$ this can be thought of as the ratio of the number of co-occurrences of words$$\[w_i\]$$ $$\[w_j\]$$to the number of all other co-occurrences involving w  to the number of all other co-occurrences involving $$\[w_i\]$$. Under the given assumptions, a high ratio would mean that the word $$\[w_i\]$$is a likely keyword for the document.
One clear problem would be if a word only occurs once in the document. The ratio value would not be based on enough information to be statistically significant. This ratio could also be unexpectedly high, since its row in the co-occurrence array would be extremely entirely sparse. To combat this, the authors use a Pearson’s chi-squared test (also known as an X^2 value) for each word in the document.

										
									let n=number of words (7)
								 let O = observed frequency (8)


								let E = expected frequency (9)

$$\[x^2=\sum^n_{i=0}{\left(\frac{{(O_i-F_i)}^2}{E_i}\right)}\]$$
(10) 
     
This test allows the frequency distribution of each word to be tested and compared to an expected distribution. The authors expected a random dis- tribution of words, and compared the observed distribution to the expected.

A word which occurs a small number of times would have an occurrence distribution close to the expected (random) distribution and would have a low X^2 value, while a word that occurs frequently and regularly co-occurs with another word would have a high x ^2value.

The authors showed that this technique was able to closely match TF-IDF, but did not rely on the use of a document corpus.
**Content-Sensitive Single Document Keyword Extraction**

Another method of keyword extraction was developed by Ohsawa, et al. [7], attacks this problem from a different angle. While many methods of keyword extraction rely on statistical information gathered from term occurrence frequency in the document, this method, called KeyGraph, relies instead on clustering of related items into groups to determine which words in a document are representative of the document’s content.

KeyGraph builds a graph representation of the document, with terms as nodes, and edges between nodes representing commonly occurring co-occurrences occurring within the document. Clusters of words are then identified by locating maximally-connected subgraphs within the document graph. Can- didate keywords are then identified by locating nodes within the graph that have edges between two separate clusters. Intuitively, these candidate key- words are terms that join separate ideas or concepts (clusters), which the author[s] of the document in question wrote when he had both concepts in mind. These candidate keywords are then ranked by the probability that for each of the clusters they join, that word was the word used to join the two clusters (effectively, the most common word used used to join these clusters).

Tests of KeyGraph shows that it was able to match, and surpass, TF-IDF in a series of tests run by its creators [7]. Additionally, a series of tests run on social media data collected during the 2008 presidential election showed that KeyGraph was able to locate keywords in a noisy environment with large amounts of irrelevant information [10].

**Keyword Extraction Using Lexical Chains**

Lexical chains are simply a list of related words found in a text. The relationships are usually semantic, such as synonymy, hyponymy, or meronymy. One example of a lexical chain would be the following [6]:

Rome → capital → city → inhabitant                                                                             (11)

This representation of semantic information in natural languages allows for context to be encoded in the structure of the chain. In the given example, this can be seen by the word ”inhabitant” following the word ”city”. The word ”inhabitant” follows ”city” since ”inhabitant” is related to ”capital”. If not for this semantic information, the next word in the chain might have been something concerning graphs or data structures.

Lexical chains have regularly found use in automated text summarization techniques [1] [11] where they can be used to quickly and accurately locate terms and sequences of terms with similar meanings. Miller [6] gives the example: ”A sugar maple is a maple that...”. By following the example lexical chain, we can clearly see that ”sugar maple” and ”maple” have very similar meanings, and one of them may be a candidate for removal from the text for the purpose of creating a more concise summary of the phrase.

Sugar maple → maple → tree → plant

It was proposed by Ercan and Cicekli that lexical chains could be used to locate words important to a text [2]. Their technique relies around using a statistical classifier (C4.5) to build decision trees which can be used to determine whether a given word is a likely keyword. To do this, they assign each term in the text to the lexical chains which contain that term. Terms are then assigned scores based on the first and last locations within the document where the term is used, the average frequency at which the term appears, the first and last locations within the document where synonyms, hypernyms/hyponyms, and meronyms occur.

By using C4.5 with these scores, the authors were able to locate keywords within the text that matched author-supplied keywords with up to 64% accuracy.

**Keyphrase Extraction Using Bayes Classifier**

An adaptation of TF-IDF was used in conjunction with a naive bayes classifier by Frank et al. [3] to locate keyphrases in a document within a corpus. This method works by running the TF-IDF variation in equation 12 over every phrase in the document.

(12) T F I D F(p,d)=Pr[phrase in d is p]*-logPr[p appears in any document]
where p is the phrase in question, and d is the current document. The probability that each phrase in the document is a keyphrase is then determined using Bayes’ theorem:

$$\[
Pr\left[key\vert{}T,D\right]=\frac{Pr\left[T\vert{}key\right]\times{}Pr⁡[D\vert{}key]\times{}Pr⁡[key]}{Pr⁡[T,D]}
\]$$	
(13)
where T is the TF-IDF value computed earlier, and D is the distance into the document of the first occurrence of the given phrase (the number of phrases that appear before it). Thus, Pr[T |key] is the probability that the phrase in question has the TF-IDF value T, Pr[D |key] is the probability that the phrase occurs at distance D into the document in question, and Pr[key] is the probability the phrase is a keyphrase, out of all phrases in the document. Pr[T,D] is used to normalize the resulting value to fall in the range [0,1].

The phrases are then ranked by the probabilities that they are keyphrases given T,D, and the k desired keyphrases are extracted from the top k phases in the ranking.

The authors showed that this method performed either comparably or slightly better than contemporary methods in a series of tests against a collection of websites and a set of medical journal articles [3].

**Conclusion**

TF-IDF is one of the best-known and most commonly used keyword extraction algorithms currently in use [8] when a document corpus is available. Several newer methods adapt TF-IDF for use as part of their process, and many others rely on the same fundamental concept as TF-IDF. Nearly all keyword extraction algorithms which make use of a document corpus depend on a weighted function which balances some measure of term or phrase appearance within a document (frequency, location within document, co- occurrence with other words) with some similar measure from the corpus.

When a corpus is unavailable, keyword extraction techniques must usually make use of additional measurements in addition to those used by TF-IDF and related methods. Additional information sources include some form of lexical or semantic analysis, or some co-occurrence measure.
**discuss more solution**
Although the keyword extraction methods can be divided as (1) document-oriented and (2) collection- oriented, we are most interested in some of the other systematization in order to get a broad overview of the area. The approaches for keyword extraction can be rather roughly categorized into either (1) unsupervised or (2) supervised. Supervised approaches require annotated data source, while unsupervised require no annotations in advance. The massive use of social networks and Web 2.0 tools has caused turbulence in development of new methods for keyword extraction. In order to improve the performance of methods on this massive data, some of the new methods are (3) semi-structured. The Figure 1 shows the different techniques that are combined into supervised, unsupervised or both approaches.  Two critical issues of supervised approaches are demand to train data with manually annotated keywords and the bias towards the domain on which they are trained. The following is a detailed overview on related work for keyword extraction methods.

**A. Supervised** 
 : The main idea of supervised methods is to transform keywords extraction into a binary classification task: Kea (Witten et al., 1999 [6]) and GenEx (Turney, 1999 [7]) are two typical and well-known systems [6, 7], which set the whole research field of the keyword extraction. The task is to classify words form the text into the keywords candidates, which is a binary classification task word is either keyword or not. The most important features for classifying a keyword candidate in these systems are the frequency and location of the term in the document. In short, GenEx uses Quinlan’s C4.5 decision tree induction algorithm to his learning task, while Kea for training and keyphrase extraction uses Naïve Bayes machine learning algorithm. GenEx and Kea are extremely important systems because, in this field of keyword extraction, they set up the foundation for all other methods that have been developed later, and have become state-of-the-art benchmark for evaluating the performance of other methods.  Hulth (2003) in [8] explores incorporation of the linguistic knowledge into the extraction procedure and uses Noun Phrase chunks (NP) (rather than term frequency and n-grams), and adds the POS tag(s) assigned to the term as feature. In more details, extracting NP- chunks gives a better precision than n-grams, and by adding the POS tag(s) assigned to the term as a feature, improves the results independent of the term selection approach applied.  Turney (2003) in [9] implements enhancements to the Kea keyphrase extraction algorithm by using statistical associations between keyphrases and enhances the coherence of the extracted keywords.  Song et al. (2003) represent Information Gain-Based keyphrase extraction system called KPSpotter [10].  HaCohen-Kerner et al. (2005) in [14] investigate automatic extraction and learning of keyphrases from scientific articles written in English. They use different machine learning methods and report that the best results are achieved with J48 (an improved variant of C4.5).  Medelyan and Witten (2006) propose a new method called KEA++, which enhances automatic keyphrase extraction by using semantic information on terms and phrases gleaned from a domain-specific thesaurus [11]. KEA++ is actually an improved version of the previously mentioned Kea devised by Witten et al. Zhang Y. et al.  The group of researchers in [13] (2006) propose use of not only “global context information”, but also “local context information”. For the task of keyword extraction they engaged Support Vector Machines (SVM). Experimental results in indicate that the proposed SVM based method can significantly outperform the baseline methods for keyword extraction.  Wang (2006) in [17] follows these features in order to determine whether a phrase is a keyphrase: TF and IDF, appearing in the title or headings (subheadings) of the given document, and frequency appearing in the paragraphs of the given document in the combination with Neural Networks are proposed. Nguyen and Kan (2007) [12] propose algorithm for keyword extraction from scientific publications using linguistic knowledge. They introduce features that capture salient morphological phenomena found in scientific keyphrases, such as whether a candidate keyphrase is an acronym or weather uses specific terminologically productive suffixes. Zhang C. et al. (2008) in [15] implement keyword extraction method from documents using Conditional Random Fields (CRF). CRF model is a state-of-the-art sequence labeling method, which can use the features of documents more sufficiently and efficiently, and considers a keyword extraction as the string labeling task. CRF model outperforms other ML methods such as SVM, Multiple Linear Regression model, etc. Krapivin et al. (2010) in [16] use NLP techniques to improve different machine learning approaches (SVM, Local SVM, Random Forests) to the problem of automatic keyphrases extraction from scientific papers. Evaluation shows promising results that outperform state-of-the-art Bayesian learning system KEA on the same dataset without the use of controlled vocabularies. 
 **B. Unsupervised **
 HaCohen-Kerner (2003) in [26] presents a simple model that extracts keywords from abstracts and titles. Model uses unigrams, 2-grams and 3-grams, and stop- words list. The highest weighted group of words (merged and sorted n-grams) is proposed as keywords.  Pasquier (2010) in [27] describes the design of keyphrase extraction algorithm for a single document using sentence clustering and Latent Dirichlet Allocation. The principle of the algorithm is to cluster sentences of the documents in order to highlight parts of text that are semantically related. The clustering is performed by using the cosine similarity between sentence vectors. K-means, Markov Cluster Process (MCP) and ClassDens techniques. The clusters of sentences, that reflect the themes of the document, are analyzed for obtaining the main topic of the text. Most important words from these topics are proposed as keyphrases.  Pudota et al. (2010) in [28] design domain independent keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. They engaged n-grams, but they also incorporate linguistic knowledge (POS tags) and statistics (frequency, position, lifespan) of each n-gram in defining candidate phrases and their respective feature sets.  Very recent research of Yang et al. (2013) in [29] focused on keyword extraction based on entropy difference between the intrinsic and extrinsic modes, which refers to the fact that relevant words significantly reflect the author’s writing intention. Their method uses the Shannon’s entropy difference between the intrinsic and extrinsic mode, which refers that words occurrences are modulated by the author’s purpose, while the irrelevant words are distributed randomly in the text. They indicates that the ideas of this work can be applied to any natural language with words clearly identified, without requiring any previous knowledge about semantics or syntax.
**C. Graph-Based ** 

Ohsawa et al. (1998) in [18] propose algorithm for automatic indexing by co-occurrence graphs constructed from metaphors, called KeyGraph. This algorithm is based on the segmenting of a graph, representing the co- occurrence between terms in a document, into clusters. Each cluster corresponds to a concept on which author’s idea is based, and top ranked terms by a statistic based on each term’s relationship to these clusters are selected as keywords. KeyGraph proved to be content sensitive, domain independent device of indexing. Lahiri et al. (2014) in [2] extract keywords and keyphrases form co-occurrence networks of words and from noun-phrases collocations networks. Eleven measures (degree, strength, neighborhood size, coreness, clustering coefficient, structural diversity index, page rank, HITS hub and authority score, betweenness, closeness and eigenvector centrality) are used for keyword extraction from directed/undirected and weighted networks. The obtained results on 4 data sets suggest that centrality measures outperform the baseline term frequency – inverse document frequency (TF-IDF) model, and simpler measures like degree and strength outperform computationally more expensive centrality measures like coreness and betweenness. Boudin (2013) in [3] compares various centrality measures for graph-based keyphrase extraction. Experiments on standard data sets of English and French show that simple degree centrality achieves results comparable to the widely used TextRank algorithm; and that closeness centrality obtains the best results on short documents. Undirected and weighted co-occurrence networks are constructed from syntactically (only nouns and adjectives) parsed and lemmatized text using co- occurrence window. Degree, closeness, betweenness and eigenvector centrality are compared to PageRank ad proposed by Mihalcea (2004) in [4] as a baseline. Degree centrality achieves similar performance as much complex TextRank. Closeness centrality outperforms TextRank on short documents (scientific papers abstracts). Litvak and Last (2008) in [6] compare supervised and unsupervised approaches for keywords identification in the task of extractive summarization. The approaches are based on the graph-based syntactic representation of text and web documents. The results of the HITS algorithm on a set of summarized documents performed comparably to supervised methods (Naïve Bayes, J48, SVM). The authors suggest that simple degree-based rankings from the first iteration of HITS, rather than running it to its convergence, should be considered. Grineva et al. (2009) in [36] use community detection techniques for key terms extraction on Wikipedia's texts, modelled as a graph of semantic relationships between terms. The results showed that the terms related to the main topics of the document tend to form a community, thematically cohesive groups of terms. Community detection allows the effective processing of multiple topics in a document and efficiently filters out noise. The results achieved on weighted and directed networks from semantically linked, morphologically expanded and disambiguated n-grams from the article's titles. Additionally, for the purpose of the noise stability, they repeated the experiment on different multi-topic web pages (news, blogs, forums, social networks, product reviews) which confirmed that community detection outperforms TF-IDF model. Palshikar (2007) in [5] proposes a hybrid structural and statistical approach to extract keywords from a single document. The undirected co-occurrence network, using a dissimilarity measure between two words, calculated from the frequency of their co-occurrence in the preprocessed and lemmatized document, as the edge weight, was shown to be appropriate for the centrality measures based approach for keyword extraction

Mihalcea and Tarau (2004) in [4] report a seminal research which introduced a state-of-the-art TextRank model. TextRank is derived from PageRank and introduced to graph based text processing, keyword and sentence extraction. The abstracts are modelled as undirected or directed and weighted co-occurrence networks using a co-occurrence window of variable sizes (2-10). Lexical units are preprocessed: stop-words removed, words restricted with POS syntactic filters (open class words, nouns and adjectives, nouns). The PageRank motivated score of the importance of the node derived from the importance of the neighboring nodes is used for keyword extraction. The obtained TextRank performance compares favorably with the supervised machine learning n-gram based approach.  Matsou et al. in [36] present an early research where a text document is represented as an undirected and unweighted co-occurrence network. Based on the network topology, the authors proposed an indexing system called KeyWorld, which extracts important terms (pairs of words) by measuring their contribution to small-world properties. The contribution of the node is based on closeness centrality calculated as the difference in small- world properties of the network with the temporarily elimination of a node combined with inverse document frequency (idf). Erkan and Radev [33] introduce a stochastic graph- based method for computing the relative importance of textual units on the problem of text summarization by extracting the most important sentences. LexRank calculates sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. A connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. LexRank is shown to be quite insensitive to the noise in the data. Mihalcea (2004) in [31] presents an extension to earlier work [4], where the TextRank algorithm is applied for the text summarization task powered by sentence extraction. On this task TextRank performed on a par with the supervised and unsupervised summarization methods, which motivated the new branch of research based on the graph-based extracting and ranking algorithms. Tsatsaronis et al. (2010) in [44] present SemanticRank, a network based ranking algorithm for keyword and sentence extraction from text. Semantic relation is based on the calculated knowledge-based measure of semantic relatedness between linguistic units (keywords or sentences). The keyword extraction from the Inspec abstracts' results reported a favorable performance of SemanticRank over state-of-the-art counterparts - weighted and unweighted variations of PageRank and HITS. Huang et al. [42] propose an automatic keyphrase extraction algorithm using an unsupervised method based on connectedness and betweeness centrality. Litvak et al. (2011) in [6] introduce DegExt, a graph- based language independent keyphrase extractor, which extends the keyword extraction method described in [6]. They also compare DegEx with state-of-the-art approaches: GenEx [11] and TextRank [4]. DegEx surpasses both in terms of precision, implementation simplicity and computational complexity.  Abilhoa and de Castro (2014) in [21] propose a keyword extraction method representing tweets (microblogs) as graphs and applying centrality measures for finding the relevant keywords. They develop technique named Twitter Keyword Graph where in the pre- processing step they use tokenization, stemming and stop- words removal method. Keywords are extracted from the graph by cascade applying graph centrality measures – closeness and eccentricity. To performance of the algorithm is tested on a single text from the literature and compared with the TF-IDF approach and KEA algorithm. Finally, algorithm is tested on five sets of tweets of increasing size. The computational time to run the algorithms proved to be a robust proposal to extract keywords from texts, especially from short texts like micro blogs.  Zhou et al. (2013) in [22] investigate weighted complex network based keyword extraction incorporating exploration of the network structure and linguistics knowledge. The focus is on the construction of lexical network including reasonable selection of nodes, proper description of relationships between words, simple weighted network and TF-IDF. Reasonable selection of words from texts as lexical nodes from linguistic perspective, proper description of relationship between words and enhancement of node attributes attempts to represent texts as lexical networks more accurately. Jaccard coefficient is used to reflect the associations or relationships of two words rather than usual co-occurrence criteria in the process of network construction. Importance of each node to become a keyword candidate is calculated with closeness centrality. Compound measure that takes node attributes (words length and IDF) into account is used. Approach is compared with three competitive baseline approaches: binary network, simple weighted network and TF-IDF approach. Experiments for Chinese indicate that the lexical network constructed by this approach achieves preferable effect on accuracy, recall and F-value over the classic TF-IDF method.  Wan and Xiao (2008) in [23] propose a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. A specified document is expanded to a small document set by adding a few neighbor documents close to the document using cosine similarity measure, while the term weight is computed by TF-IDF. Local information in the specified document and the global information in the all neighbor documents are taken into consideration along expanded document set with graph-based ranking algorithm.   Xie (2005) in [46] study different centrality measures in order to predict noun phrases that appear in the abstracts of scientific articles. Tested measures are: degree, closeness, betweenness and information centrality. Their results show that centrality measures improve the accuracy of the prediction in terms of both precision and recall. Furthermore, the method of constructing noun- phrase (NP) network significantly influences the accuracy when using the centrality heuristic itself, but is negligible when it is used together with other text features in decision trees.


----------
[1] M. W. Berry, J. Kogan, Text Mining: Applications and Theory, Wiley, UK, 2010. 
[2] S. Lahiri, S. R. Choudhury, C. Caragea, “Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks”, arXiv preprint arXiv:1401.6571, 2014. 
[3] F. Boudin, “A comparison of centrality measures for graph-based keyphrase extraction”, in Int. Joint Conf. on Natural Language Processing (IJCNLP), pp. 834-838, 2013.
[4] R. Mihalcea, P. Tarau, “TextRank: Bringing order into texts”, in ACL Empirical Methods in Natural Language Processing- EMNLP04, pp. 104-411, 2004. 
[5] G. K. Palshikar, “Keyword Extraction from a Single Document Using Centrality Measures” in 2nd Int. Conf. PReMI 2007, LNCS 4815, pp. 503-510, 2007. 
[6] M. Litvak, M. Last, H. Aizenman, I. Gobits, A. Kandel, “DegExt – A Language-Independent Graph-Based keyphrase Extractor” in Proc. of the 7th AWIC 2011, pp. 121-130, Switzerland, 2011. 
[7] J-L. Wu, A. M. Agogino, “Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms, in Proc. of the 37th HICSS, pp. 104-111, , 2003.  [5] Y. Zhang, E. Milios, N. Zincir-Heywood, “A Comparison of Keyword- and Keyterm-based Methods for Automatic Web Site Summarization” in Tech. Report: Papers for the on Adaptive Text Extraction and Mining, pp. 15-20, San Jose, 2014.  

[8] Y. Zhang, E. Milios, N. Zincir-Heywood, “A Comparison of Keyword- and Keyterm-based Methods for Automatic Web Site Summarization” in Tech. Report: Papers for the on Adaptive Text Extraction and Mining, pp. 15-20, San Jose, 2014.  
[9] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, C. G. Nevill- Manning, “Kea: Pra-ctical Automatic Keyphrase Extraction” in Proc. of the 4th ACM Conf. of the Digital Libraries, Berkeley, CA, USA, 1999.  
[10] A. Hulth, “Improved Automatic Keyword Extraction Given More Linguistic Knowledge” in Proc. of EMNLP 2003, pp. 216-223, Stroudsburg, USA, 2003. 
[11] P. D. Turney, “Coherent Keyphrase Extraction via Web Mining” in Proc. of IJCAI 2003, pp. 434-439, San Francisco, USA, 2003. 
[10] M. Song, I.-Y. Song, X. Hu, “KPSpotter: a flexible information gain-based keyphrase extraction system” in Proc. of 5th Int. Workshop of WIDM 2003, pp. 50-53, 2003. 
[11] O. Medelyan, I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing, in Proc. of the 6th ACM/IEEE-CS JCDL 2006, pp. 296- 297, New York, USA, 2006. 
[13] K. Zhang, H. Xu, J. Tang, J. Li, “Keyword Extraction Using Sup- port Vector Machine” in Proc. of 7th Int. Conf. WAIM 2006, pp. 85-96, Hong Kong, China, 2006.  
[12] Y. HaCohen-Keren, Z. Gross, A. Masa, “Automatic Extraction and Learning of Keyphrases from Scientific Articles” in Proc. of 6th Int. Conf. CICLing 2005, pp. 657-669, Mexico City, Mexico, 2005. 
[13] C. Zahang, H. Wang, Y. Liu, D. Wu, Y. Liao, B. Wang, “Automatic Keyword Extraction from Documents Using Conditional Random Fields” in Journal of CIS 4:3(2008), pp. 1169-1180, 2008.
 [16] M. Krapivin, A. Autayeu, M. Marchese, E. Blanzieri, N. Segata, “Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing” in Proc. of 12th Int. Conf. on Asia-Pacific Digital Libraries, ICADL 2010, Gold Coast, Australia, LNAI v.6102, pp. 102-111, 2010. 
[14] J. Mijić, B. Dalbelo Bašić, J. Šnajder, “Robust Keyphrase Extraction for a Largescale Croatian News Production System” in Proc. of EMNLP, pp. 59-99, 2010.
 [15] P. Chen, S. Lin,   "Automatic keyword prediction using Google similarity distance", presented at Expert Syst. Appl.,  pp. 1928- 1938, 2010.
 [16] K. S. Jones, “Informaion retrieval and artificial inteligence”, Artificial Intelligence, 114(1-2), 257-281, 1999. 
 [17] J.-Y. Chang, I.-M. Kim, “Analysis and Evaluation of Current Graph-Based Text Mining Researches” in Advanced Science and Technology Letters, v.42, pp. 100-103, 2013.
[18] Y. Ohsawa, N. E. Benson, M. Yachida, “KeyGraph: Automatic Indexing by Co-Occurrence Graph Based on Building Construction Metaphor” in Proc. ADL 1998, pp. 12-18, 1998. 
[19] Y. HaCohen-Kerner, “Automatic Extraction of Keywords from Abstracts” in Proc. of 7th Int. Conf. KES 2003 (LNCS v. 2773), pp, 843-849, 2003.
 [20] C. Pasquier, “Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation” in Proc. of the 5th Int. Workshop on Semantic Evaluation (ACL 2010), pp. 154-157, 2010. 
[21] W. D. Abilhoa, L. N. de Castro, “A keyword extraction method from twitter messages represented as graphs” Applied Mathematics and Computation v. 240, pp. 308-325, 2014. 
[22] Z. Zhou, X. Zou, X. Lv, J. Hu, “Research on Weighted Complex Network Based Keywords Extraction” in Lecture Notes in Computer Science Volume 8229, 2013, pp. 442-452, 2013.
 [23] X. Wan, J. Xiao, “Single Document Keyphrase Extraction Using Neighborhood Knowledge” in Proc.of the 23rd AAAI Conference on Artificial Intelligence, pp. 855-860, 2008.
 [24] J. Mijić, B. Dalbelo-Bašić, J. Šnajder “Robust keyphrase extraction for a large-scale Croatian news production system” FASSBL 2010, pp. 59-66, 2010.
 [25] J. Saratlija, J. Šnajder, B. Dalbelo-Bšić, “Unsupervised topic- oriented keyphrase extraction and its application to Croatian”, Text, Speech and Dialogue, pp. 340-347, 2011.
[26] A. Masucci, G. Rodgers, “Diferences between normal and shufled texts: structural properties of weighted networks. Advances in Complex Systems, 12(01):113-129, 2009.
 [27] M. E. J. Newman, Networks: An Introduction, Oxford University Press, 2010. 
[28] S. Beliga, A. Meštrović, S. Martinčić-Ipšić, „Toward Selectivity Based Keyword Extraction for Croatian News”, Submitted on Workshop on Surfacing the Deep and the Social Web, Co- organized by ICT COST Action KEYSTONE (IC1302), Riva del Garda, Trento, Italy, 2014. 
[29] R.V. Sole, B. C. Murtrta, S. Valverde, L. Steels, “Language Networks: their structure, function and evolution”, Trends in Cognitive Sciences, 2005.
 [30] J. Borge-Holthoefer, A. Arenas, “Semantic networks: Structure and dynamics”, Entropy 2010, 12(5), pp. 1264-1302, 2010. 
 [31] R. Mihalcea, D. Radev, Graph-based Natural Language Processing and Information Retrieval, Cambridge University Press, 2011. 
 [32] R. Barzilay, M. Elhadad, et al. Using lexical chains for text summariza- tion. In Proceedings of the ACL workshop on intelligent scalable text summarization, volume 17, pages 10–17, 1997.
[33] G. Ercan and I. Cicekli. Using lexical chains for keyword extraction. Information Processing & Management, 43(6):1705–1714, 2007.
[34] Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and et al. Domain-specific keyphrase extraction. In PROC. SIXTEENTH IN- TERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTEL- LIGENCE, pages 668–673. Morgan Kaufmann Publishers, 1999.
[35] K.S. Jones. A statistical interpretation of term specificity and its ap- plication in retrieval. Journal of documentation, 28(1):11–21, 1972.
[36] Y. Matsuo and M. Ishizuka. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13:2004, 2004.
[37] G.A. Miller et al. Wordnet: a lexical database for english. Communi- cations of the ACM, 38(11):39–41, 1995.
[38] Yukio Ohsawa, Nels E. Benson, and Masahiko Yachida. Keygraph: Au- tomatic indexing by co-occurrence graph based on building construc- tion metaphor. In Proceedings of the Advances in Digital Libraries Conference, ADL ’98, pages 12–, Washington, DC, USA, 1998. IEEE Computer Society.
[39] S. Robertson. Understanding inverse document frequency: on theo- retical arguments for idf. Journal of Documentation, 60(5):503–520, 2004.
[40] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
[41] H. Sayyadi, M. Hurst, and A. Maykov. Event detection and tracking in social streams. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), 2009.
[42] H.G. Silber and K.F. McCoy. Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Com- putational Linguistics, 28(4):487–496, 2002.
[43] Christian Wartena, Rogier Brussee, and Wout Slakhorst. Keyword ex- traction using word co-occurrence. In Proceedings of the 2010 Work- shops on Database and Expert Systems Applications, DEXA ’10, pages 54–58, Washington, DC, USA, 2010. IEEE Computer Society.
[44] Tsatsaronis, I. Varlamis,  K. Nørvag, “SemanticRank: ranking keywords and sentences using semantic graphs” in ACL 23rd  Int. Conf. on Computational Linguistics, pp.1074-1082, 2010. 
[45] M. Grineva, M. Grinev, D. Lizorkin, “Extracting Key Terms From Noisy and Multi-theme Documents” in Proc. of the 18th Int. Conf. on World Wide Web, pp. 661-670, NY, USA, 2009. 
[46] Z. Xie, “Centrality Measures in Text Mining: Prediction of Noun Phrases that Appear in Abstracts” in Proc. of 43rd Annual Meeting of the Association for Computational Linguistics, ACL, University of  Michigan, USA, 2005.