recognizing and extracting exact name entities, like Persons, Locations and Organizations from natural language text is called Named Entity Recognition (NER). for example in the sentence "[PER Murdoch] discusses future of [ORG News Corp]" understanding that the word "Murdoch" is a Person and the entity "News Corp" is the name of an Organization is the task of an NER agent.
Abstract:
in this paper, firstly, we are going to explain different techniques, methods, discuss design challenges, misconceptions and feature extraction which underline the development of an efficient and robust NER system. then we are going to introduce Iris: the Farsi name entity recognizer agent which is based on statistical distribution techniques and machine learning methods. Iris uses techniques corresponding to natural language processing and Farsi words and sentence's syntax and structure.

1. Introduction

Natural Language Processing applications are characterized by making complex interdependent decisions that require large amounts of existing knowledge. in this paper we discuss one such application, Named Entity Recognition (NER). the sentence below shows the necessity of using prior knowledge and non-local decisions. in the absence of such method it is difficult to understand "Blinker" is a person.

[PER BLINKER] BAN LIFTED . [LOC LONDON] 1996-12-06 [MISC Dutch] forward [PER Reggie Blinker] had his indefinite suspension lifted by [ORG FIFA] on Friday and was set to make his [ORG Sheffield Wednesday] comeback against [ORG Liverpool] on Saturday. [PER Blinker] missed his club’s last two games after [ORG FIFA] slapped a worldwide ban on him for appearing to sign contracts for both [ORG Wednesday] and [ORG Udinese] while he was playing for [ORG Feyenoord].>

it is also not obvious that the last mention of "Wednesday" is an organization. the whole task of an NER system is to identify, extract and label the entity names in a text.
data is being generated with a fast pace and with it the need to use them, but most of the data created is unstructured and unlabeled, the basic task of labeling this data, i.e. giving meaning to the text in a way computer and other algorithms understand is very important. it is the basis of many other problems such as Question Answering, Summarization Systems, Information Retrieval, Machine Translation, Semantic Web Search and Bioinformatics.
at this time researchers use different methods for this task which we will discuss. NER involves two tasks, which are firstly identification of proper names in text, and secondly the classification of these names into a set of predefined categories of interest.

1.1. Name entity types

different tasks require differnet number of types, but the most common types are PER for person names, ORG for organizations (companies, governmetnt organizations, committees, etc), LOC for locations (cities, countries, etc).
The CoNLL1 shared task concentrated on four name entities: persons (PER), locations (LOC), organizations (ORG) and names of miscellaneous entities (MISC) that do not belong to the previous three groups.
MUC2 in their MUC7 dataset which is a subset of the North American News Text Corpora annoted entitites including people, locations, organizations, monetry units, and so on.

1.2. NER tagging schemes

before we go along, in the task of NER there are two main tagging schemes, BIO and BILOU.

  • BIO: 'B' stands for 'beginning' and signifies the beginning of a name entity, 'I' stands for 'inside' and signifies that the word is inside of a name entity, and 'O' stands for 'outside' which signifies that the word is just a regular word outside of a name entity[1].
    table 1-1 - an example for name entity tags based on BIO scheme.

NE tag word of sentence
B-PER John
I-PER Kerry
O will
O fly
O to
B-LOC paris
  • BILOU: it stands for Beginning, Inside and Last tokens of multi-token chunks, Unit-length chunks and Outside, which is self explanatory, in the evaluation section we'll discuss the evaluation process based on each scheme and problems that exist[1].
    table 1-2 - an example for name entity tags based on BILOU scheme.

NE tag word of sentence
B-PER John
L-PER Kerry
O will
O fly
O to
U-LOC paris

1.3. Different NER approaches

in references [2,3] different approaches of NER are introduced, each with different usage:

1.3.1. Rule-based NER

hand made rule-based approach focuses on extracting name entities using a lot of human-made rule sets, generally the tagger system consists of a set of patterns using grammatical (e.g. part of speech), syntactic (e.g. word precedence) and orthogiraphic features (e.g. capitalization, prefixes, etc) in combination with dictionaries. these kind of models have better results for restricted domains and have the ability to detect complex entities that learning models have difficulty with. how ever these rule-based systems lack the ability of portability and even though the data is slightly changed, they have a high cost to change the rules. these approaches are domain and language specific and do not easily adapt to new domains and languages[2].
FASTUS is a system based on handcrafted regular expressions which divided the task into three main steps: recognizing phrases, recognizing patterns & groups and in the third stage merging the incidents, in which distinct structures that describe the same entity are identified and merged[4].

1.3.2. Machine learning-based NER

in these systems which is mostly the focus of this article, the NER task is converted into a classification problem and statistical models are employed to solve it, which will be discussed later on. in this type of approach, the system looks for patterns and relationshios in the text to make a model using statistical models and machine learning algorithms. unlike the rule-based method, these types of approaches can be easily changed and used at different domains and languages[2,3].
in this task two main methods of machine learning are used[2]:

  • supervised learning: it involves using a system that can learn to classify a given set of labled examples, the supervised term is used because the people who marked the training examples are teaching the program the right distinctions. this approach requires preparing a large amount of labeled training data. many methods such as HMM3[5], CRF4[6], perceptron[7] of this type of learning are proposed for NER which will be discussed later.

  • semi-supervised: a recent approach is to use both labeled and unlabelled data, this approach benefits from both supervised and unsupervised learning has some interesting techniques. labled data is hard to find and expensive and time consuming to produce, in contrast unlabelled data is cheap and easy to obtain a huge amount of it. recent result and especially in this task have shown that one could reach better results, simply by adding unlabelled data to their supervised system. some commonly used methods for semi-supervised learning are co-training, graph-based methods and so on[2].

1.3.3. Hybrid NER

the approach is to combine the rule-based and machine learning-based methods and make new methods from strongest points from each method. although this type of method might get better results than other approaches, the weakness of handcraft rule-based NER remains the same[2].

2. Related works

In recent years, automatic named entity recognition and extraction systems have become one of the popular research areas that a considerable number of studies have been addressed on developing these systems. NE5 recognization like many NLP6 tasks, can be broken down to two stages, preprocessing and processing.
the focus of this paper will mostly be on machine learning methods, so the features and methods discussed are mostly in same context.

2.1. Preprocessing

in an NLP problem we are dealing with a natural language, in which some things are fixed and hardly change, so we can use them as features to solve our problem. as illustrated in the introduction, NER is a knowledge-intensive task so in this section some important knowledge rescources will be discussed.

2.1.1. Gazetteers

an imortant question at the inception of NER was whether machine learning techniques are necessary at all, and wheter simple dictionary lookup would be suffiient for a good performance. indeed the baseline for this task was just a dictionary lookup for entitites which are in the training data[8]. it turns out that while problems of coverage and ambiguity prevent a simple straightforward lookup, using gazetteers as a feature in machine-learning based approches is critical for good performance[3].
knowing this, one can extract gazetters from the web and large collections of unlabeled text.

2.1.2. Brown clustering

it's been illustrated that unlabeled text can be used to improve the performance of the NER system[3]. brown clustering is a hierarchical word clustering algorithm[9]. the intuition behind this algorithm is that similar words appear in similar contexts, more precisely similar wods, have similar distribution of words to their immediate left and right. for example before the word "the" it's likely that you see "in" or "on" and many similar words. the algorithm takes sentences as input and based on the words and their similarities in the contexts and it outputs each word and its cluster. for example here is an input and output of the algorithm[10].
input:

the cat chased the mouse
the dog chased the cat
the mouse chased the dog

table 2-1. output of the brown clustering algorithm

cluster represented as bit string word
0 the
10 chased
110 dog
1110 mouse
1111 cat

2.2. Processing

in this section we study some local and non local features.

2.2.1. Lexical information

using n-gram one can get a good improvement in results. n-gram is a language model for predicting the next item in a sequence of items, in this case characters of the words. an n-gram of size one is called unigram and size 2 is called bi-gram, which are the common n-grams used in this task. for example for a sequences of characters, the 3-grams that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth[2].

2.2.2. Affix

affix is an addition to the base form of a word in order to change its meaning or create a new word, e.g. suffix and prefix letters.
using this feature one can once again identify similar words[2].

2.2.3. POS7 tags

part of speech tags which are givent to a word based on its role in the sentence as well as its definition and the context i.e. relationship with adjacent and related words in a sentence or phrase.
table 2-2. an example of POS tagging[2,3].

POS tag word
DET the
N_SING waiter
VBD cleared
N_PL plates
P from
DET the
N_SING table

2.2.4. Previous NE information

previous NE information also called UniChunk, can be used as a feature in many cases. the intuition behind it, is to find a similar situation, which had been recognized before in the train set and compare it to current situation of work chunks, knowing a number of predictions before e.g. two last name entity predictions can help find the similar situation[2,11].

2.2.5. Some simple patterns

  • capitalization of the word(in languages that have capitalization)[3].

  • number normalization: normalizing dates and numbers, that is 12/8/2004 becomes Date, 1980 becomes DDDD allows a degree of abstraction to years, phone numbers, etc[3].

  • word shapes: map words to simplified representation that encodes attributes such as length, campitalization, numerals, Greek letters, internal punctuation, etc[11].
    table 2-3. an example of word shape[11].

shape word
Xx-xxx varicella-zoster
xXXX mRNA
XXXd CPA1

2.2.6. Classifier

in this task the most common classifiers used, are called sequence classifiers which in this case are ME8 sequence models and CRFs which predict labels from data[12].
in ME, there are words and the goal is to assign classes or tags to each of them. it's common to better explore the search space i.e. at a certain position one NE tag is the best, but later on once in another state it gives one a reason that maybe the former tags shoud have been chosen differently. in this section we study two of these inferences.

  • Beam inference: in Beam inference at each position rather than just deciding the most likely label, we can keep several posibilities, so we might keep the top K complete sub sequences up to the point where we are at so far, and so at each state, we'll consider each partial subsequence and extend that to one further position. it's very fast and easy to implement and it turns out, at Beam sizes of 3-5, gives enough posibilities and works as well as exact inference in many case. but one gets no guarantees that has find the globally best name entity tag sequence, and even worse the best sequence might not even be checked, or in terms of Beam inference, might fall off the Beam[12].

    Figure 2-1. showing sequence inference based on sequence model

  • Viterbi inference: it's a form of dynamic programming and requires a small window of state influence (e.g. past two states are relevant). it guarantees that the global best sequence is returned, but it's harder to implement and forces one to a fixed small window[12].

  • CRFs: conditional random fields are another kind of probabilistic sequence model. CRFs are undirected graphical models used to calculate the conditional probability of values on particular output nodes given values assigned to other input nodes[13]. suppose the observed input data sequence:

    o = \left \{ o_{1},o_{2},...o_{T} \right \}
    such as a sequence of words in a document, and let 'S' be a set of FSM states, each associated with a tag, such as ORG, having the sequence of states (the values on T output nodes):
    s = \left \{ s_{1},s_{2},...s_{T} \right \}
    CRF can define the conditional probability of a state sequence given an input sequence as:
    P(s|o) = \frac{1}{Z_{0}} exp(\sum_{t=1}^{T}\sum_{k}\lambda _{k}f_{k}(s_{t-1},s_{t},o,t))
    where Z_{0} is a normalization factor over all state sequeces, f_{k} is an arbitary feature function over its arguments, and \lambda_{k} is a learned weight for each feature function[13]. we'll discuss the CRFs with more detail later in the implementation section.

3. Evaluation

the evaluation of name entity task is a little tricky, firstly becuse there are different number of name entity tag sets and also different schemes (e.g. earlier discussed BIO, and BILOU, etc) and secondly because it is done per entity and not per token, because in this task the goal is to find both the bandries and the class of a name entity, in other words our unit of interest is the whole entity.

Figure 3-1. example of NE evaluation per entity, showing the unit of interest

so in the contingency table the values will be at the level of entities.

3.1. Precision and Recall

precision and recall are standart tools in IR9 evaluation. in this task TP10 are the entities that system labeled correctly, FP11 are the entities that system labeled incorrectly, FN12 are the entities that should have been labeled, but the system did not labeled them. with that in mind Precision will be:

Precision = \frac{TP}{TP+FP}

and Recall would be:

Recall = \frac{TP}{TP+FN}

Precision and Recall are straight for tasks like IR and text categorization, where there is only one grain size document. these measure behave a bit differently for NER when there are boundary errors, which are very common. for example in the sentence "First Bank of Chicago announced earnings ..." if the system labeled "Bank of Chicago" as an organization instead of labeling the "First Bank of Chicago" it would count as both a FP and a FN, so selecting nothing would have been better.

3.2. F-score

F-score is a harmonic mean of precision and recall.

F_{\beta } = (1+\beta ^{2}).\frac{precision . recall}{(\beta ^{2} . precision)+recall}

despite the challenges in these measurements, they're the most common measures for evaluation in NER.

4. Tests

5. Future work

6. References

[1] http://stackoverflow.com/questions/17116446/what-do-the-bilou-tags-mean-in-named-entity-recognition. as available on November 2015.
[2] Mansouri, Alireza.Affendey, Lilly Suriani,Mamat, Ali. Named Entity Recognition Approaches. Journal of Computer Science. 2008. vol 8. pages 339-344.
[3] Lev Ratinov, Dan Roth. Design Challenges and Misconceptions in Named Entity Recognition. Proceedings of the Thirteenth Conference on Computational Natural Language Learning. June 2009. pages 147-155.
[4] Jerry R. Hobbs, Douglas Appelt, John Bear,David Israel, Megumi Kameyama, Mark Stickel, and Mabry Tyson. FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text. “SRI International FASTUS
[5] Zhou, GuoDong and Su, Jian. Named entity recognition using an HMM-based chunk tagger. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics ACL 02. July 2002. pages 473-480.
[6] Vijay Krishnan and Christopher D. Manning.(july 2006) An Effective Two-StageModel for Exploiting Non-Local Dependencies in Named Entity Recognition. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL ACL 06
[7] Lars Buitinck and Maarten Marx. (2012). Two-stage named-entity recognition using averaged perceptrons system MUC-6 test results and analysis”, Proceedings of the MUC-6, NIST, Morgan-Kaufmann Publisher, Columbia, 1995
[8] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the Conference on Natural Language Learning-2003 Shared Task:Language-Independent Named Entity Recognition. (2003)
[9] Peter F. Brown; Peter V. deSouza; Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai (1992). Class-based n-gram models of natural language. Computational Linguistics 18.
[10] Percy Liang. Semi-Supervised Learning for Natural Language. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science Journal. 2005. pages 1-86.
[11] Professor Dan Jurafsky & Chris Manning. Sequence Models for Named Entity Recognition. Coursera Stanford NLP course. video 9.3. as view on November 2015.
[12] Professor Dan Jurafsky & Chris Manning. Maximum Entropy Sequence Models. Coursera Stanford NLP course. video 9.4. as view on November 2015.
[13] Mccallum, Andrew. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. Conference on Computational Natural Language Learning. 2003. pages 1-4.


  1. the Conference on Natural Language Learning

  2. Message Understanding Conference

  3. hidden Markov model

  4. conditional random field

  5. name entity

  6. natural language processing

  7. part of speech

  8. maximum entropy

  9. information retrieval

  10. true positives

  11. false positives

  12. false negatives

محسن ایمانی

برای این فاز زحمت زیادی کشیدید و خروجی کار هم بسیار خوب از کار در آمده است که جای تبریک و خداقوت دارد!
مشتاقانه منتظر پیاده‌سازی و نتایج شما در فازهای آینده هستیم.
چند نکته در این باره قابل ذکر است:

  • در این فاز شما هیچ صحبتی در مورد موجودیت‌های نام‌دار در زبان فارسی نکردید. با توجه به این که تفاوت عمده‌ای در روش‌های کلی بین زبان فارسی و انگلیسی وجود ندارد، برای این فاز قابل اغماض است اما در فازهای آینده حتما باید به این نکته که هدف شما کار روی زبان فارسی است در پیاده‌سازی و ارزیابی دقت نظر داشته باشید.

  • مراجع خوبی را انتخاب کرده‌اید اما زیاد سراغ مقالات به روز و روش‌های جدیدتر نرفتید، شاید بد نبود به صورت اجمالی نگاهی به چند مقاله جدید و روند پیشرفت در این شاخه از پردازش زبان طبیعی می‌کردید.

  • بهتر بود توضیحی اجمالی از نقشه راه خود برای فازهای آتی در بخش کارهای آینده بیان می‌نمودید.

  • با توجه به این که این پروژه در نیم‌سال‌های گذشته نیز انجام و ارزیابی شده است، چیزی که از شما برای پیاده‌سازی انتظار داریم، باید فراتر از تلاش‌های گذشته در نیم‌سال‌های قبلی باشد.