Sentiment Analysis of Data from Online Forums on the Newborn Genome Sequencing

Poursepanj, Hamid

Sentiment Analysis of Data from Online Forums on the Newborn Genome Sequencing

Files

hamid_poursepanj_2015_thesis.pdf (4.21 MB)

Date

2015

Authors

Poursepanj, Hamid

Publisher

Université d'Ottawa / University of Ottawa

Abstract

In this thesis, we classified user comments posted on online forums related to “Newborn Genome Sequencing” (NGS). User comments were annotated as irrelevant, positive, negative, or mixed by two annotators. The objective was to create a classification model that could predict the sentiment of each user comment with a high accuracy. To compare classifiers, a baseline classifier (Accuracy 52%) was created. We created a single classifier (called flat comment-level classifier with accuracy of 65.14%) to classify comments into irrelevant, positive, negative, or mixed. A more sophisticated classifier, named two-level comment classifier, consisting of two classifiers, was created (Accuracy 69.81%): - The first classifier that classified each comment into relevant or irrelevant ones. - The second classifier that classified each relevant comment (predicted by the first classifier) as positive, negative, or mixed. 18 extra features were generated to improve the accuracy of the flat classification compared to baseline classifier (from 52% to 65.14% for flat comment classification, and 69.48% to 69.81% for two-level comment classification). Attempts were made to enhance the result of the two-level comment classifier by using the discourse structure of each sentence in a comment. The accuracy achieved by this enhanced two-level classifier was 64.24%. Therefore, removing irrelevant EDUs did not improve the accuracy. To achieve the above-mentioned enhancement, all comments were segmented into their consisting elementary discourse units (EDUs). We removed irrelevant EDUs from the relevant comments before running the second classifier. Furthermore, we performed EDU-level classification by creating two classifiers: - A flat classifier: classified all EDUs into irrelevant, positive, negative, or neutral - A two-level EDU: classified EDUs, first, into relevant or irrelevant and then classified the relevant EDUs (predicted by the first classifier) into positive, negative, or neutral ones. The accuracy achieved for the flat EDU-level classifier was 81.84%. However, due to the highly imbalanced nature of the EDU dataset, the F-measure for positive, negative, and neutral class was very low. Under-sampling was performed to improve the F-measure for positive, negative, and neutral class. Another topic investigated was to know why forum users supported or rejected NGS. To extract the arguments, the comments were segmented into EDUs. Following segmenting, each EDU was annotated as relevant or irrelevant to NGS. Each relevant EDU was annotated as for or against NGS. Topic related EDUs were selected as well as two EDUs before and after the topic-related EDUs. Bigrams, trigrams, four-grams and five-grams were created from extracted EDUs. Five-grams were more meaningful for human annotators, and were therefore favoured and ranked based on frequency in the dataset. Following ranking of the five grams, the top five were selected as the possible arguments.

Keywords

Sentiment Analysis, Newborn Genome Sequencing

URI

http://hdl.handle.net/10393/32393
http://dx.doi.org/10.20381/ruor-4379

Collections

Thèses, 2011 - // Theses, 2011 -

Full item page

Sentiment Analysis of Data from Online Forums on the Newborn Genome Sequencing

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections