Originally conceived as a "naive" baseline experiment using traditional n-gram language models as classifiers, the NCLEANER system has turned out to be a fast and lightw...
The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such...
The Prediction by Partial Matching (PPM) algorithm uses a cumulative frequency count of input symbols in different contexts to estimate their probability distribution. Excellent c...
The widespread use of email has raised serious privacy concerns. A critical issue is how to prevent email information leaks, i.e., when a message is accidentally addressed to non-...
Half the world’s population is expected to live in urban areas by 2020. The high human density and changes in peoples’ consumption habits result in an everincreasing amount of...
Inbal Reif, Florian Alt, Juan David Hincapié...