Appled NLP and ML for  the detection of inappropiarte text in a communications platform

Urrutia Zubikarai, Aitor

Visualitza/Obre

148095.pdf (722,2Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Urrutia Zubikarai, Aitor

Tutor / directorAlquezar Mancho, René; Riera Molina, Jordi

Tipus de documentProjecte Final de Màster Oficial

Data2020-01-31

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

With the expansion of the communication platforms, individuals are very used to exchange information and communicate with other people through these platforms, both for social and business purposes. It’s a known problem, that many people use the anonymity provided by these platforms to use inappropriate and offensive language. The company NextreT S.L. has built a communication platform directed to business use. As the company wants to avoid offensive language in this platform, natural language processing tools in big data environment are going to be used to analyze each written text to detect and remove if required this inappropriate and offensive language. During this project, a variety of techniques used in the State of the Art are analyzed, compared and then tested using a completely new data set in Spanish language created using Tellfy App and Twitter corpus. Initially, different word encoding methods are tested, including word embedding like Word2Vec and FastText. In addition, different hyper parameter configurations are checked as well as model performances with different data sizes. Finally, after a forward feature selection phase, model ensemble techniques are tested. During these tests, it has been shown that the combination of the features that are used is very important to increase the performance of the models. Also, the different word representation techniques are very related to the performance of the models. Furthermore, the sizes of the training sets that are used need to be as representative and as large as possible. Finally, after using different complex Deep Neural Network models, more traditional Logistic Regression models can offer a better performance.

MatèriesMachine learning, Natural language processing (Computer science), Aprenentatge automàtic, Tractament del llenguatge natural (Informàtica)

TitulacióMÀSTER UNIVERSITARI EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 2017)

URIhttp://hdl.handle.net/2117/336082

Col·leccions

Màsters oficials - Master in Artificial Intelligence - MAI [280]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
148095.pdf		722,2Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Appled NLP and ML for the detection of inappropiarte text in a communications platform

Visualitza/Obre

Explora