Linguistic resources for entity extraction

August 04, 2013 by Paula Petcu


In the world of search engines, a process called entity extraction might rescue situations where the search solution indexes content that is unstructured and has poor or no metadata. Here are some good linguistic resources that can be used to quickly boost the search experience and discoverability of the content.

What is entity extraction?

Entity extraction, also known as named entity recognition (NER), refers to the process of identifying elements in text that can be associated to predefined categories such as people names, locations (cities, countries), organizations, quantities, percentages and more. [1] Thus, through this process, you can practically transform unstructured text into an annotated block of text, and provide new possibilities of interpreting the content.

This can result in a positive impact on the search experience, allowing for example to navigate the content by filtering on the extracted entities (such as companies, locations names, specialized terms, product terminology), to query on the entities or do boosting based on their values, to find similar search results, to enrich the document with metadata, to monitor trends, and probably more.


How to make an entity extractor?

The process through which the entities are identified varies depending on the needs, the problem being solved, and the chosen technical solution. While the most basic entity extractors use a dictionary (practically a list of names) or rules, other statistical approaches could lead to better performance depending on the application. There is good reading material out there about what has been done in the field of NER in the past 20 years and which are the current challenges and trends in this area of research [2].

Depending on the search technology that you choose to work with, the entity extraction logic could be already integrated in the search platform (as it is in SharePoint 2013 for example, where it is referred to as the custom entity extraction feature [3]), or with a tool that can help you build the logic yourself (the free OpenNLP tools used in combination with for example Apache Solr).


Dictionary based entity extraction

A basic entity extractor can be build by using a dictionary of names and by matching the free form text with the entries from the dictionary. In the enterprise world, the dictionary could consist of product names, department names, domain-specific terminology (such as medical terms), but a very simple example would be using a dictionary for identifying city names.

Let’s take the following example, where we have an excerpt of text from a news story talking about city cycling [4]. Using a dictionary of city names, we match the names from the dictionary with the words and phrases found in the original text. By doing this, we add new metadata to the document (locations). We can now associate the news article with the locations mentioned in it. Moreover, the newly identified entities can be then used in improving user experience and discoverability. For example, as the entities identified in this text represent locations, you could for example show a geographical map pinning down the locations the article mentions, providing the reader with a quick overview of the cities that the author thought were worth mentioning when discussing bike lanes availability around the world.


However, a simple question follows as soon as you decide to start working on the entity extractor: where do you get a complete list of city names? If the entity extraction is to be done on for example product names or organizational structure, then you will most probably have to actually build the dictionary yourself. However, for more general entity extractors (such as names of locations) or domain-specific entity extractors (for the medical or entertainment industry for example), open access resources are already built for you.


Resources for dictionary based entity extraction

You can download a list of city and country names from around the world, as well as people names, movies, company names and other entity types from various online resources. One place from which you can start is the source package of the GATE project (gate.ac.uk). Download and then unpackage the archive and you will find many lists of entities. The following resources are also worth checking, amongst probably many other: dbpedia.org and freebase.com.

As for organizations names, opencorporates.com (opencorporates.com) stores information about more than 5 million companies from around the world. If a country is not covered by this source, another possible data source to check is actually the national company registry website. In countries such as Denmark, these websites can offer, aside a search service for finding Danish businesses, the possibility of downloading the data about these.


Should you use it?

Dictionary based entity extraction can be a quick fix to your search solution, with a high impact over the discoverability of your content and the search experience in general. That being said, depending on the type of entities you want to identify, the domain which you are working with, and your requirements in terms of performance, other methods than using a dictionary might be better suited for your needs of identifying entities in textual data. Also, entity extraction is best used in combination with other linguistic processes, such as coreference resolution, relationship extraction between entities, translation of named entities, fuzzy name matching, etc. Just do a search on Google Scholar for entity extraction and you will see how active this research area continues to be.




REFERENCES

1. Named-entity recognition on Wikipedia
2. A survey of named entity recognition and classification
3. Create and deploy custom entity extractors in SharePoint Server 2013
4. Vancouver’s bike lanes have made it a city to watch



comments powered by Disqus