Appled NLP and ML for the detection of inappropiarte text in a communications platform
Visualitza/Obre
Estadístiques de LA Referencia / Recolecta
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/336082
Tipus de documentProjecte Final de Màster Oficial
Data2020-01-31
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
With the expansion of the communication platforms, individuals are very used
to exchange information and communicate with other people through these platforms,
both for social and business purposes. It’s a known problem, that many
people use the anonymity provided by these platforms to use inappropriate and
offensive language.
The company NextreT S.L. has built a communication platform directed to
business use. As the company wants to avoid offensive language in this platform,
natural language processing tools in big data environment are going to be used
to analyze each written text to detect and remove if required this inappropriate
and offensive language.
During this project, a variety of techniques used in the State of the Art are
analyzed, compared and then tested using a completely new data set in Spanish
language created using Tellfy App and Twitter corpus.
Initially, different word encoding methods are tested, including word embedding
like Word2Vec and FastText. In addition, different hyper parameter configurations
are checked as well as model performances with different data sizes. Finally,
after a forward feature selection phase, model ensemble techniques are tested.
During these tests, it has been shown that the combination of the features that
are used is very important to increase the performance of the models. Also, the
different word representation techniques are very related to the performance of
the models. Furthermore, the sizes of the training sets that are used need to be
as representative and as large as possible. Finally, after using different complex
Deep Neural Network models, more traditional Logistic Regression models can
offer a better performance.
MatèriesMachine learning, Natural language processing (Computer science), Aprenentatge automàtic, Tractament del llenguatge natural (Informàtica)
TitulacióMÀSTER UNIVERSITARI EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 2017)
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
148095.pdf | 722,2Kb | Visualitza/Obre |