Named entity normalization in user generated content

Jijkoun, V.; Khalid, M.A.; Marx, M.; de Rijke, M.

doi:https://doi.org/10.1145/1390749.1390755

Author: V. Jijkoun
M.A. Khalid
M. Marx
M. de Rijke
Year: 2008
Title: Named entity normalization in user generated content
Journal: ACM International Conference Proceedings Series
Volume: 303
Pages (from-to): 23-30
Document type: Article
Faculty: Faculty of Science (FNWI)
Institute: Informatics Institute (IVI)
Abstract: Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems.
A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references.
To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.
URL: go to publisher's site
Language: English
Note: Proceedings title: Proceedings of SIGIR 2008 Workshop on Analytics for Noisy Unstructured Text Data (AND 08), July 24, 2008, Singapore
Publisher: Association for Computing Machinery (ACM)
Place of publication: New York, NY
ISBN: 978-1-60558-196-5
Editors: D. Lopresti, S. Roy, K. Schulz, L.V. Subramaniam
Persistent Identifier: https://hdl.handle.net/11245/1.296644

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.