Sciweavers

DIS
2007
Springer

Unsupervised Spam Detection Based on String Alienness Measures

13 years 9 months ago
Unsupervised Spam Detection Based on String Alienness Measures
We propose an unsupervised method for detecting spam documents from Web page data, based on equivalence relations on strings. We propose 3 measures for quantifying the alienness (i.e. how different it is from others) of substring equivalence classes within a given set of strings. A document is then classified as spam if it contains a characteristic equivalence class as a substring. The proposed method is unsupervised, independent of language, and is very efficient. Computational experiments conducted on data collected from Japanese web forums show fairly good results. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models; I.5.4 [Applications]: Text processing General Terms Algorithm, Experimentation, Performance Keywords Spam Detection, Equivalence Class
Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano, Mas
Added 07 Jun 2010
Updated 07 Jun 2010
Type Conference
Year 2007
Where DIS
Authors Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano, Masayuki Takeda
Comments (0)