1. Introduction
Reference is a relation between objects in which one object designates, or acts as a means by which to connect to or link to, another object. The first object in this relation is said to refer to the second object. The second object – the one to which the first object refers – is called the referent of the first object.
References can take on many forms, including: a thought, a sensory perception that is audible (onomatopoeia), visual (text), olfactory, or tactile, emotional state, relationship with other spacetime coordinate, symbolic or alpha-numeric, a physical object or an energy projection; but, other concrete and abstract contexts exist as methods of defining references within the scope of the various fields that require an origin, point of departure, or an original form. This includes methods that intentionally hide the reference from some observers, as in cryptography.
our goal in this project is to find same references and marge them into one reference.
2. Related Works
2.1. Document clustering
Document clustering (or Text clustering) is automatic document organization, topic extraction and fast information retrieval or filtering. It is closely related to data clustering.
A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such as PolyMeta and Helioid, or open source software such as Carrot2.
Examples:
Clustering divides the results of a search for "cell" into groups like "biology," "battery," and "prison."
FirstGov.gov, the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more.
Probabilistic Latent Semantic Analysis (PLSA) can also be conducted to perform document clustering.
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared offline applications.
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the K-means algorithm and its variants. Usually, it is of greater efficiency, but less accurate than the hierarchical algorithm. These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Latent semantic indexing, a form of dimensionality reduction, is a soft clustering algorithm.
Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.
K-mean
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.
Example of implementing K-mean algorithm in python:
>>> import sys
>>> from math import log, sort
>>> from itertools import combinations
>>> def choose_cluster(node, cluster_lookup, edges):
>>> new = cluster_lookup[node]
>>> if node in edges:
>>> seen, num_seen = {}, {}
>>> for target, weight in edges.get(node, []):
>>> seen[cluster_lookup[target]] = seen.get(
>>> cluster_lookup[target], 0.0) + weight
>>> for k, v in seen.iteritems():
>>> num_seen.setdefault(v, []).append(k)
>>> new = num_seen[max(num_seen)][0]
>>> return new
>>>
>>> def majorclust(graph):
>>> cluster_lookup = dict((node, i) for i, node in enumerate(graph.nodes))
>>> ...
>>> def get_distance_graph(documents):
>>> class Graph(object):
>>> def __init__(self):
>>> self.edges = {}
>>> ...
>>> def get_documents():
>>> texts = [...]
>>> def main(args):
>>> documents = get_documents()
>>> add_tfidf_to(documents)
>>> dist_graph = get_distance_graph(documents)
>>>
>>> for cluster in majorclust(dist_graph):
>>> print "========="
>>> for doc_id in cluster:
>>> print documents[doc_id]["text"]
2.2. Markov logic network
Briefly, it is a collection of formulas from first order logic, to each of which is assigned a real number, the weight. Taken as a Markov network, the vertices of the network graph are atomic formulas, and the edges are the logical connectives used to construct the formula. Each formula is considered to be a clique, and the Markov blanket is the set of formulas in which a given atom appears. A potential function is associated to each formula, and takes the value of one when the formula is true, and zero when it is false. The potential function is combined with the weight to form the Gibbs measure and partition function for the Markov network.
The above definition glosses over a subtle point: atomic formulas do not have a truth value unless they are grounded and given an interpretation; that is, until they are ground atoms with a Herbrand interpretation. Thus, a Markov logic network becomes a Markov network only with respect to a specific grounding and interpretation; the resulting Markov network is called the ground Markov network. The vertices of the graph of the ground Markov network are the ground atoms. The size of the resulting Markov network thus depends strongly (exponentially) on the number of constants in the domain of discourse.
3. Experiments
A good clustering of text requires effective feature selection and a proper choice of the algorithm for the task at hand. Among the different classes of algorithms, the distance-based methods are among the most popular in a wide variety of applications.
In recent years, the main trend in research in this area has been in the context of two kinds of text data:
Dynamic Applications: The large amounts of text data being created by dynamic applications such as social networks or online chat applications has created an immense need for streaming text clustering applications. Such streaming applications need to be applicable in the case of text which is not very clean, as is often the case for applications such as social networks.
Heterogeneous Applications: Text applications increasingly arise in heterogeneous applications in which the text is available in the context of links, and other heterogeneous multimedia data. For example, in social networks such as Flickr the clustering often needs to be applied in such scenario. Therefore, it is critical to ef- fectively adapt text-based algorithms to heterogeneous multimedia scenarios.
We note that the field of text clustering is too vast to cover comprehen- sively in a single chapter. Some methods such as committee-based clustering [1] cannot even be neatly incorporated into any class of methods, since they use a combination of the different clustering methods in order to create a final clustering result. The main purpose of this chapter is to provide a comprehensive overview of the main algorithms which are often used in the area, as a starting point for further study.
3.1. Distance-based Clustering Algorithms
Agglomerative and Hierarchical Clustering Algorithms
The general concept of agglomerative clustering is to successively merge documents into clusters based on their similarity with one another. Almost all the hierarchical clustering algorithms successively merge groups based on the best pairwise similarity between these groups of documents. The main differences between these classes of methods are in terms of how this pairwise similarity is computed between the different groups of documents. For example, the similarity between a pair of groups may be computed as the best-case similarity, average- case similarity, or worst-case similarity between documents which are drawn from these pairs of groups. Conceptually, the process of agglom- erating documents into successively higher levels of clusters creates a cluster hierarchy (or dendogram) for which the leaf nodes correspond to individual documents, and the internal nodes correspond to the merged groups of clusters. When two groups are merged, a new node is created in this tree corresponding to this larger merged group. The two children of this node correspond to the two groups of documents which have been merged to it.
Distance-based Partitioning Algorithms
Partitioning algorithms are widely used in the database literature in order to efficiently create clusters of objects. The two most widely used distance-based partitioning algorithms are as follows:
3.1.1. k-medoid clustering algorithms:
In k-medoid clustering algorithms, we use a set of points from the original data as the anchors (or medoids) around which the clusters are built. The key aim of the algorithm is to determine an optimal set of representative documents from the original corpus around which the clusters are built. Each document is assigned to its closest representative from
the collection. This creates a running set of clusters from the corpus which are successively improved by a randomized process
The algorithm works with an iterative approach in which the set of k representatives are successively improved with the use of ran domized interchanges. Specifically, we use the average similarity of each document in the corpus to its closest representative as the objective function which needs to be improved during this interchange process. In each iteration, we replace a randomly picked representative in the current set of medoids with a randomly picked representative from the collection, if it improves the clustering ob- jective function. This approach is applied until convergence is achieved.
There are two main disadvantages of the use of k-medoids based clustering algorithms, one of which is specific to the case of text data. One general disadvantage of k-medoids clustering algorithms is that they require a large number of iterations in order to achieve convergence and are therefore quite slow. This is because each iter- ation requires the computation of an objective function whose time requirement is proportional to the size of the underlying corpus.
The second key disadvantage is that k-medoid algorithms do not work very well for sparse data such as text. This is because a large fraction of document pairs do not have many words in common, and the similarities between such document pairs are small (and noisy) values. Therefore, a single document medoid often does not contain all the concepts required in order to effectively build a cluster around it. This characteristic is specific to the case of the information retrieval domain, because of the sparse nature of the underlying text data.
3.1.2. k-means clustering algorithms:
he k-means clustering algorithm also uses a set of k representatives around which the clusters are built. However, these representatives are not necessarily obtained from the original data and are refined somewhat differently than a k-medoids approach. The simplest form of the k-means approach is to start off with a set of k seeds from the original corpus, and assign documents to these seeds on the basis of closest sim- ilarity. In the next iteration, the centroid of the assigned points to each seed is used to replace the seed in the last iteration. In other words, the new seed is defined, so that it is a better central point for this cluster. This approach is continued until convergence. One of the advantages of the k-means method over the k-medoids method is that it requires an extremely small number of iterations in order to converge. Observations seem to suggest that for many large data sets, it is sufficient to use 5 or less iterations for an effective clustering. The main disadvantage of the k-means method is that it is still quite sensitive to the initial set of seeds picked during the clustering. Secondly, the centroid for a given cluster of documents may contain a large number of words. This will slow down the similarity calculations in the next iteration. A number of methods are used to reduce these effects, which will be discussed later on in this chapter.
implementing document clustering using scikit-learn and K-mean algorithm
>>> from sklearn.cluster import KMeans
>>> from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
>>> from sklearn.cross_validation import train_test_split
>>>
>>> docs = []
>>> print docs[0]
>>> vectorizer = TfidfVectorizer(lowercase=False, max_df=0.8)
>>> fs = vectorizer.fit_transform(docs)
>>> fs_train, fs_test = train_test_split( fs, test_size=0.4, random_state=0)
>>> kmeans = KMeans(n_clusters=2)
>>> kmeans.fit(fs_train)
>>>
>>> pred = kmeans.predict(fs_test)
>>> print pred
3.2. Word and Phrase-based Clustering
Since text documents are drawn from an inherently high-dimensional domain, it can be useful to view the problem in a dual way, in which important clusters of words may be found and utilized for finding clusters of documents. In a corpus containing d terms and n documents, one may view a term-document matrix as an n × d matrix, in which the (i,j)th entry is the frequency of the jth term in the ith document. We note that this matrix is extremely sparse since a given document contains an extremely small fraction of the universe of words. We note that the problem of clustering rows in this matrix is that of clustering documents, whereas that of clustering columns in this matrix is that of clustering words. In reality, the two problems are closely related, as good clusters of words may be leveraged in order to find good clusters of documents and viceversa. For example, the work in [2] determines frequent itemsets of words in the document collection, and uses them to determine compact clusters of documents. This is somewhat analogous to the use of clusters of words for determining clusters of documents. The most general technique for simultaneous word and document clus- tering is referred to as co-clustering . This approach simultaneous clusters the rows and columns of the term-document matrix, in order to create such clusters. This can also be considered to be equivalent to the problem of re-ordering the rows and columns of the term-document matrix so as to create dense rectangular blocks of non-zero entries in this matrix. In some cases, the ordering information among words may be used in order to determine good clusters. The work in [3] determines the frequent phrases in the collection and leverages them in order to determine document clusters.
It is important to understand that the problem of word clusters and document clusters are essentially dual problems which are closely related to one another. The former is related to dimensionality reduction, whereas the latter is related to traditional clustering. The boundary be- tween the two problems is quite fluid, because good word clusters provide hints for finding good document clusters and vice-versa. For example, a more general probabilistic framework which determines word clusters and document clusters simultaneously is referred to as topic modeling [4]. Topic modeling is a more general framework than either clustering or dimensionality reduction.
3.3. Semi-Supervised Clustering
In some applications, prior knowledge may be available about the kinds of clusters that are available in the underlying data. This prior knowledge may take on the form of labels attached with the documents, which indicate its underlying topic. For example, if we wish to use the broad distribution of topics in the Yahoo! taxonomy in order to supervise the clustering process of a new web collection, one way to performing supervision would be add some labeled pages from the Y ahoo! taxonomy to the collection. Typically such pages would contain labels of the form Science Astronomy or Arts Painting, which indicate the subject area of the added pages. Such knowledge can be very useful in creating significantly more coherent clusters, especially when the total number of clusters is large. The process of using such labels to guide the clustering process is referred to as semi-supervised clustering. This form of learning is a bridge between the clustering and classification problem, because it uses the underlying class structure, but it is not completely tied down by the specific structure. As a result, such an approach finds applicability both to the clustering and classification scenarios.
3.4. Regular Expression
In theoretical computer science and formal language theory, a regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep (global regular expression print), a filter.
Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning. Together, they can be used to identify textual material of a given pattern, or process a number of instances of it that can vary from a precise equality to a very general similarity of the pattern. The pattern sequence itself is an expression that is a statement in a language designed specifically to represent prescribed targets in the most concise and flexible way to direct the automation of text processing of general text files, specific textual forms, or of random input strings.
A very simple use of a regular expression would be to locate the same word spelled two different ways in a text editor, for example serialisze. A wildcard match can also achieve this, but wildcard matches differ from regular expressions in that wildcards are limited to what they can pattern (having fewer metacharacters and a simple language-base), whereas regular expressions are not. A usual context of wildcard characters is in globbing similar names in a list of files, whereas regular expressions are usually employed in applications that pattern-match text strings in general.
A regular expression processor processes a regular expression statement expressed in terms of a grammar in a given formal language, and with that examines the target text string, parsing it to identify substrings that are members of its language, the regular expressions.
Regular expressions are so useful in computing that the various systems to specify regular expressions have evolved to provide both a basic and extended standard for the grammar and syntax; modern regular expressions heavily augment the standard. Regular expression processors are found in several search engines, search and replace dialogs of several word processors and text editors, and in the command lines of text processing utilities, such as sed and AWK.[10] [11]
3.4.1. syntax
A regexp pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is what matches at a point in the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters. Metacharacters help form: atoms; quantifiers telling how many atoms (and whether it is a greedy quantifier or not); a logical OR character, which offers a set of alternatives, and a logical NOT character, which negates an atom's existence; and back references to refer to previous atoms of a completing pattern of atoms. A match is made, not when all the atoms of the string are matched, but rather when all the pattern atoms in the regular expression have matched. The idea is to make a small pattern of characters stand for a large number of possible strings, rather than compiling a large list of all the literal possibilities.
Depending on the regexp processor there are about fourteen metacharacters, characters that may or may not have their literal character meaning, depending on context, or whether they are "escaped", i.e., preceded by an escape sequence, in this case, the backslash . Modern and POSIX1 extended regular expressions use metacharacters more often than their literal meaning, so to avoid "backslash-osis" it makes sense to have a metacharacter escape to a literal mode; but starting out, it makes more sense to have the four bracketing metacharacters ( ) and { } be primarily literal, and "escape" that usual meaning to become metacharacters. Common standards implement both. The usual metacharacters are {}^$.|*+? and . The usual characters that become metacharacters when escaped are dsw.DSW and N.
4. future works
4.1. Optimizing Regular Expression Clustering for Massive Pattern Search
Optimizing the search for a large sets of patterns, due to their prevalence, is becoming critical in a variety of applications including financial services, healthcare monitoring, RFID- based inventory management, publish subscribe and network intrusion detection systems (NIDS). For example in NIDS, to identify particular packets in a packet stream, modern network devices need to perform deep packet inspection at high rates given a limited memory and a large number of signatures. It is often proposed to group together regular expressions (REs) denoting similar patterns into a DFA or an NFA to decrease the memory usage and increase the average throughput. In this paper, we propose a framework comprised of three novel distance metrics used to cluster REs and an execution technique that uses the information aggregated by the distance metrics to optimize the number of memory reads during execution, thereby increasing the overall throughput. Our distance metrics are unique in that each metric is specialized for a particular type of patterns. The first two metrics optimize for compactness of REs and specialize in regular expressions that have long, contiguous similarity segments. Our third metric optimizes for redundancy and specializes in REs that have random segments of similar. We show how compression, prefetching and memory access optimizations are made possible when the notion of redundancy is leveraged. We validate our approach by implementing it in a modern NIDS Snort, which matches potentially thousands of patterns against an incoming network stream. We observe a five-fold speed improvement over the unmodified version of Snort.
Effects of Number of Clusters on Memory
In this section, we observe how the number of clusters affects memory usage when using only redundancy or only similarity distance metrics. when using redundancy the overall memory usage is much lower than when using similarity. We can also observe that memory does not grow linearly with the number of clusters, which is due to the fact that as the number of clusters varies, the cluster compression does also. Seldom it is the case that more clusters in fact require less amount memory. This is mainly because with fewer clusters, dissimilar patterns might have been merged requiring relatively large amounts of memory. Note that similarity metric produces more sporadic results as the number of clusters is varied, as opposed to the redundancy metric which is less sensitive to the number of clusters.
Observe that obtaining the optimal number of clusters is a difficult problem. In our approach we merge patterns until we run out of pre-allocated space. We can however utilize the local change in memory and in the number of states to derive the approximation for when to terminate our merging algorithm. This derivation of approximate number of pattern clusters is however part of our future work.[12]
Effects of Clustering on Throughput
We fix the amount of total memory clusters are allowed to consume, and using the redundancy distance metric we observe how the number of clusters affects throughput. As can be seen from Table III there is an inverse relationship between the throughput and the number of groups. Having the initial memory size set to 1024KB we were able to generate 10 groups (from original 100 patterns) without increasing memory usage. Fewer groups are created with higher memory limits. Throughput was measured relative to the initial setting. The initial throughput was 29315 bps and with memory limit set to 2048KB the throughput increased to 58064 bps (which is approximately 1.981 times higher than initial). This throughput is much lower than that presented in Figure 5 due to a standalone implementation in Java as opposed to a native implementation of Snort.[13]
Throughput | States | Groups | Shared Memory Limits |
---|---|---|---|
1 | 442 | 10 | 1024KB |
1.981 | 1728 | 5 | 2048KB |
3.073 | 5615 | 3 | 4092KB |
8.422 | 23746 | 1 | 8184KB |
۵. #
6. References
[1] P. Pantel, D. Lin. Document Clustering with Committees, ACM SIGIR Conference, 2002.
[2] F. Beil, M. Ester, X. Xu. Frequent term-based text clustering, ACM KDD Conference, 2002.
[3 ] O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility Demonstration, ACM SIGIR Conference, 1998.
[4] T. Hofmann. Probabilistic Latent Semantic Indexing. ACM SIGIR Conference, 1999.
[5] C. C. Aggarwal, Y. Zhao, P. S. Yu. On Text Clustering with Side Information, ICDE Conference, 2012.
[6] C. C. Aggarwal, P. S. Yu. On Effective Conceptual Indexing and Similarity Search in Text, ICDM Conference, 2001.
[7] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering, ACM SIGMOD Conference, 1999.
[8] R. Agrawal, J. Gehrke, P. Raghavan. D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD Conference, 1999.
[9] M. Franz, T. Ward, J. McCarley, W.-J. Zhu. Unsupervised and supervised clustering for topic tracking. ACM SIGIR Conference, 2001.
[10] Kernighan, Brian. "A Regular Expressions Matcher". Beautiful Code. O'Reilly Media. pp. 1–2. ISBN 978-0-596-51004-6. Retrieved 2013-05-15.
[11] Friedl, Jeffrey. "The Mechanics of Expression Processing". Mastering Regular Expressions. O'Reilly Media. p. 182. ISBN 0-596-52812-4.
[12] S. Kumar, B. Chandrasekaran, J. S. Turner, and G. Varghese, “Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia,” in ANCS, 2007, pp. 155–164.
[13] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and memory-efficient regular expression matching for deep packet inspection,” in ANCS, 2006, pp. 93–102.
۷. پیوندهای مفید
POSIX: an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems. POSIX defines the application programming interface (API), along with command line shells and utility interfaces, for software compatibility with variants of Unix and other operating systems