Named entity extraction from Portuguese web text

André Ricardo Oliveira Pires

Please use this identifier to cite or link to this item: https://hdl.handle.net/10216/106094

Author(s):	André Ricardo Oliveira Pires
Title:	Named entity extraction from Portuguese web text
Issue Date:	2017-07-07
Abstract:	In the context of Natural Language Processing, the Named Entity Recognition (NER) task focuses on extracting and classifying named entities from free text, such as news, which usually has a particular phrasing structure. Entity detection supports more complex tasks, such as Relation Extraction or Entity-Oriented Search, for instance the ANT search engine. There are some NER tools focused on the Portuguese language, such as Palavras or NERP-CRF, but their F-measure is still below the F-measure obtained by other available tools, for instance based on an annotated English corpus, trained with Stanford CoreNLP or with OpenNLP. ANT is an entity-oriented search engine for the University of Porto (UP). This search system indexes the information available in SIGARRA, the information system of the UP. Currently it uses handcrafted selectors to extract entities, based on XPath or CSS, which are dependent on the structure of the page. Furthermore, it does not work on free text, specially on SIGARRA's news. Using a machine learning method allows for the automation of the extraction task, making it scalable, structure independent and lowering the required work effort and consumed time. In this dissertation, I evaluate existing NER tools in order to select the best approach and configuration for the Portuguese language, particularly in the domain of SIGARRA's news. The evaluation was done based on two datasets, the HAREM collection, and a manually annotated subset of SIGARRA's news, which are used to assess the tools' performance using precision, recall and F-measure. Expanding the existing knowledge base will help index SIGARRA pages by providing a richer entity-oriented search experience with new information, as well as a better ranking scheme based on the additional context made available to the search engine. The scientific community also benefits from this work, with several detailed manuals resulting of the systematic analysis of available tools, in particular for the Portuguese language. First, I carried an out-of-the-box performance analysis of some selected tools (Stanford CoreNLP, OpenNLP, spaCy and NLTK) with the HAREM dataset, obtaining the best results for Stanford CoreNLP, followed by OpenNLP. Then, I performed a hyperparamenter study in order to select the best configuration for each tool, having achieved better-than-default results in each tool, particularly for NLTK's Maximum Entropy classifier, increasing the F-measure from 1.11% to 35.24%. Finally, using the best configuration, I repeated the training process with the SIGARRA News Corpus, having achieved F-measures as high as 86.64%, for Stanford CoreNLP. Furthermore, given this was also the out-of-the box winner, it leads me to conclude that Stanford CoreNLP is the best option for this particular context.
Description:	No contexto de Processamento de Linguagem Natural, a tarefa de Reconhecimento de Entidades Mencionadas (REM) foca-se na extração e classificação de entidades mencionadas de texto livre, como notícias, que geralmente tem uma estrutura de frases particular. A deteção de entidades suporta tarefas mais complexas, como Extração de Relações ou Pesquisa Orientada a Entidades, como no motor de pesquisa do ANT. Há algumas ferramentas de REM focadas na língua portuguesa, tais como o Palavras ou o NERP-CRF, mas o seu F-measure ainda está abaixo do obtido usando outras ferramentas disponíveis, por exemplo com base num corpus inglês anotado, treinado com o Stanford CoreNLP ou com o OpenNLP. O ANT é um motor de busca orientado a entidades da Universidade do Porto (UP). Este sistema de pesquisa indexa as informação disponível no SIGARRA, o sistema de informação da UP. Atualmente usa seletores construídos manualmente para extrair entidades, baseados em XPath ou CSS, que são dependentes da estrutura da página. Além disso, não funcionam em texto livre, especialmente nas notícias do SIGARRA. Um método baseado em aprendizagem computacional permite a automatização da tarefa de extração, tornando-a escalável, independente da estrutura, diminuindo o esforço de trabalho exigido e o tempo consumido. Nesta dissertação, eu avaliei ferramentas de REM existentes para selecionar a melhor abordagem e configuração a utilizar em relação à língua portuguesa, particularmente no domínio das notícias do SIGARRA. A avaliação foi feita com base em dois conjuntos de dados, a coleção HAREM, e um subconjunto manualmente anotado de notícias do SIGARRA, que foram usados para calcular o desempenho das ferramentas usando precision, recall e F-measure. A expansão da base de conhecimento existente ajudará a indexar as páginas do SIGARRA proporcionando uma experiência de pesquisa orientada a entidades mais rica e com nova informação, bem como um melhor esquema de classificação baseado no contexto adicional disponibilizado ao motor de busca. A comunidade científica também beneficia deste trabalho, com múltiplos manuais detalhados resultantes da análise sistemática das ferramentas, em particular para a língua Portuguesa. Primeiramente, eu analisei a performance base de algumas ferramentas selecionadas (Stanford CoreNLP, OpenNLP, spaCy e NLTK) com a coleção HAREM, obtendo os melhores resultados com o Stanford CoreNLP, seguido do OpenNLP. De seguida, efetuei um estudo aos hiperparametros, de modo a selecionar a melhor configuração para cada ferramenta, conseguindo alcançar melhorias em cada ferramenta, principalmente no classificador de Entropia Máxima do NLTK, em que houve melhorias no F-measure de 1.11% para 35.24%. Finalmente, usando a melhor configuração, eu repeti o treino com o dataset das notícias do SIGARRA, tendo obtido F-measures de 86.64%, para o Stanford CoreNLP. Para além disso, dado que também foi o melhor na performance base, leva-me à conclusão que o Stanford CoreNLP é a melhor opção para este contexto.
Subject:	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
Scientific areas:	Ciências da engenharia e tecnologias::Engenharia electrotécnica, electrónica e informática Engineering and technology::Electrical engineering, Electronic engineering, Information engineering
TID identifier:	201795310
URI:	https://hdl.handle.net/10216/106094
Document Type:	Dissertação
Rights:	openAccess
Appears in Collections:	FEUP - Dissertação

Files in This Item:

File	Description	Size	Format
202922.pdf	Named entity extraction from Portuguese web text	2.04 MB	Adobe PDF	View/Open

Show full item record Recommend this item Display Statistics