Automating the gathering of relevant information from biomedical text
View/ Open
Date
2009Author
Canevet, Catherine
Metadata
Abstract
More and more, database curators rely on literature-mining techniques to help them
gather and make use of the knowledge encoded in text documents. This thesis investigates
how an assisted annotation process can help and explores the hypothesis that it is
only with respect to full-text publications that a system can tell relevant and irrelevant
facts apart by studying their frequency.
A semi-automatic annotation process was developed for a particular database - the
Nuclear Protein Database (NPD), based on a set of full-text articles newly annotated
with regards to subnuclear protein localisation, along with eight lexicons. The annotation
process is carried out online, retrieving relevant documents (abstracts and full-text
papers) and highlighting sentences of interest in them. The process also offers a summary
Table of the facts found clustered by type of information.
Each method involved in each step of the tool is evaluated using cross-validation
results on the training data as well as test set results. The performance of the final tool,
called the “NPD Curator System Interface”, is estimated empirically in an experiment
where the NPD curator updates the database with pieces of information found relevant
in 31 publications using the interface. A final experiment complements our main
methodology by showing its extensibility to retrieving information on protein function
rather than localisation.
I argue that the general methods, the results they produced and the discussions they
engendered are useful for any subsequent attempt to generate semi-automatic database
annotation processes. The annotated corpora, gazetteers, methods and tool are fully
available on request of the author (catherine.canevet@bbsrc.ac.uk).