Simple record

dc.contributor.authorVelez de Mendizabal Gonzalez, Iñaki
dc.contributor.authorEzpeleta Gallastegi, Enaitz
dc.contributor.authorZurutuza Ortega, Urko
dc.contributor.otherBasto-Fernandes, Vitor
dc.contributor.otherMéndez, José R.
dc.date.accessioned2020-06-19T10:11:08Z
dc.date.available2020-06-19T10:11:08Z
dc.date.issued2020
dc.identifier.issn0306-4573en
dc.identifier.otherhttps://katalogoa.mondragon.edu/janium-bin/janium_login_opac.pl?find&ficha_no=159059en
dc.identifier.urihttp://hdl.handle.net/20.500.11984/1693
dc.description.abstractIn recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.en
dc.description.sponsorshipGobierno de Españaes
dc.description.sponsorshipGobierno de Portugales
dc.language.isoengen
dc.publisherElsevier Ltd.en
dc.rights© 2020 Elsevier Ltd.en
dc.subjectSpam filteringen
dc.subjectToken-based representationen
dc.subjectSynset-based representationen
dc.subjectSemantic-based feature reductionen
dc.subjectMulti-objective evolutionary algorithmsen
dc.titleSDRS: A new lossless dimensionality reduction for text corporaen
dc.typeinfo:eu-repo/semantics/articleen
dcterms.accessRightsinfo:eu-repo/semantics/embargoedAccessen
dcterms.sourceInformation Processing & Managementen
dc.description.versioninfo:eu-repo/semantics/acceptedVersionen
local.contributor.groupAnálisis de datos y ciberseguridades
local.description.peerreviewedtrueen
local.identifier.doihttps://doi.org/10.1016/j.ipm.2020.102249en
local.relation.projectIDGE/Programa Estatal de Investigacion, Desarrollo e Innovación orientada a los retos de la sociedad en el marco del Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016, convocatoria del 2017/TIN2017-84658-C2-2-R/Integración de Conocimiento Semántico para el Filtrado de Spam basado en Contenido/SKI4SPAMen
local.relation.projectIDFundação para a Ciência e a Tecnologia/ UIDB/04466/2020 and UIDP/04466/2020.en
local.embargo.enddate2022-07-01
local.contributor.otherinstitutionInstituto Universitário de Lisboa (Iscte)es
local.contributor.otherinstitutionUniversidade de Vigoes
local.contributor.otherinstitutionInstituto de Investigación Sanitaria Galicia Sur (IISGS)es
local.source.detailsVol. 57. N. 4. n. artículo 102249,eu_ES


Files in this item

Thumbnail

This item appears in the following Collection(s)

Simple record