Sciweavers

JMLR
2006

Spam Filtering Using Statistical Data Compression Models

13 years 4 months ago
Spam Filtering Using Statistical Data Compression Models
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our emp...
Andrej Bratko, Gordon V. Cormack, Bogdan Filipic,
Added 13 Dec 2010
Updated 13 Dec 2010
Type Journal
Year 2006
Where JMLR
Authors Andrej Bratko, Gordon V. Cormack, Bogdan Filipic, Thomas R. Lynam, Blaz Zupan
Comments (0)