Probabilistic Data Generation for Deduplication and Data Linkage

15 years 11 months ago

Download cs.anu.edu.au

Abstract. In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and preprocessing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identiﬁers. In recent years there has been an increased research eﬀort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and conﬁdentiality issues. An alternative is to use artiﬁcially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a f...

Peter Christen

Real-time Traffic