Ddup - towards a deduplication framework utilising apache spark

4 years 6 months ago
Ddup - towards a deduplication framework utilising apache spark
: This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14], the Apache Spark Framework [ZCF+ 10] and its modules MLlib [MLl14] and GraphX [XCD+ 14]. The three main goals of this project are creating a prototype of the mentioned framework DduP, analysing the deduplication process about scalability and performance and evaluate the behaviour of different small cluster configurations. Tags: Duplicate Detection, Deduplication, Record Linkage, Machine Learning, Big Data, Apache Spark, MLlib, Scala, Hadoop, In-Memory
Niklas Wilcke
Added 17 Apr 2016
Updated 17 Apr 2016
Type Journal
Year 2015
Where BTW
Authors Niklas Wilcke
Comments (0)