Spam filters: Bayes vs. Chi-Squared; letters vs. words
Citation:
Cormac O'Brien and Carl Vogel `Spam filters: Bayes vs. Chi-Squared; letters vs. Words? in International Symposium on Information and Communication Technologies, Dublin, eds. Markus Aleksy, et al., 2003, pp 298 - 303Download Item:
Spam Filters.pdf (published (author copy) peer-reviewed) 69.74Kb
Abstract:
We compare two statistical methods for identifying spam or junk electronic mail. Spam
filters are classifiers which determine whether an email is junk or not. The proliferation of
spam email has made electronic filtering vitally important. The magnitude of the problem
is discussed. We examine the Naive Bayesian method in relation to the `Chi by degrees of
Freedom? approach, the latter used in the field of authorship identification. Both methods
produce very promising results. However, the `Chi by degrees of Freedom? has the advantage
of providing significance measures, which will help to reduce false positives. Statistics based
on character-level tokenization proves more effective than word-level.
Author's Homepage:
http://people.tcd.ie/vogelDescription:
PUBLISHED
Author: VOGEL, CARL
Publisher:
ACMType of material:
Conference PaperCollections:
Availability:
Full text availableKeywords:
Computer ScienceLicences: