Introduction of the Extended Named Entity Hierarchy
The Extended Named Entity Hierarchy is designed and developed to meet increasing needs for wider range of NE types. It originates from the first Named Entity set defined by MUC (Grishman et al., 1996), the Named Entity set developed by IREX (Sekine et al., 2000), and the Extended Named Entity hierarchy which contains approximately 150 NE types (Sekine et al., 2002).
The applications include Questions and Answering (Q&A) system that analyzes general texts such as newspaper articles, as well as Information Extraction (IE), Machine Translation (MT), Summarization and Information Retrieval (IR) systems that meet variety of NLP applications. For example, Q&A system provides information that one wants to know or extract from articles. Those information can be categorized into fixed number of classes with hierarchies; we designed it in the Extended Named Entity Hierarchy, Q&A system or IE system assuming that information one wants know is basically in a form of noun phrase with specific names and numerical values. In other words, it is not a word that expresses general concept or class, but rather a name of concept or thing that can be pointed out physically.
The Extended Named Entity Hierarchy is divided into three major classes; name, time, and numerical expressions (these three classes are the same as NE hierarchy defined in the MUC, IREX project). Based on our observation, we know that one’s question on a specific matter often fits in one of these categories. Having these three classes at the top of the Extended Named Entity Hierarchy, Q&A system and IE system are created taking into account the concepts and words that are generally considered as common knowledge in usual newspaper articles and encyclopedias.
Designing the Extended Named Entity Hierarchy
We defined the classes based on a criterion that frequently occurring words and noun phrases should be categorized into a class according to its meaning and usage. In practice, the following methods are used to develop the hierarchy.
- We extracted candidate NE expressions from English newspaper articles. Specifically, capitalized words (proper names) and numerical expressions (number of things) were extracted from thousands of contexts, and appropriate concept class labels were assigned to those proper names and numerical expressions. After that, we merged and divided the classes checking the overall image, and final form of the Extended Named Entity Hierarchy was created.
- Existing thesaurus and ontology (WordNet homepage) were referred to find information that matches the Extended Named Entity Hierarchy. Also, Relating researches (Sasaki, 1998; ISI homepage; ACE EDT-document) were referred for their NE hierarchies.
- The Extended Named Entity types were revised by actually tagging newspaper articles.
- We created and tested about 3,000 questions in the Q&A system. The expected response types were then classified according to the Extended NE expressions, and new classes were added if necessary.
- The index words (approximately 110,000) in encyclopedia were classified according to the Extended NEs, and new classes were added accordingly.