Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data

16 years 3 months ago

Download www.pervasivedatarush.com

All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a collaborative filtering (CF) solution to the Netflix Prize problem [1] based on weighted co-clustering [5]. The dataflow library we use facilitates the development of sophisticated parallel programs designed to fully utilize commodity multicore hardware, while hiding traditional difficulties such as queuing, threading, memory management, and deadlocks. The dataflow CF implementation first compresses the large, sparse training dataset into co-clusters. Then it generates recommendations by combining the average ratings of the co-clusters with the biases of the users and movies. When configured to identify 20x20 co-clusters in the Netflix training dataset, the implementation predicted over 100 million ratings in 16.31 minutes and achieved an RMSE of 0.88846 without any fine-tuning or domain knowledge. This is an eff...

Srivatsava Daruru, Nena M. Marin, Matt Walker, Joy

Real-time Traffic

Data Mining | Dataflow Cf Implementation | Dataflow Library | Efficient Dataflow Implementation | KDD 2009 |

claim paper

Post Info
More Details (n/a)

Added	25 Nov 2009
Updated	25 Nov 2009
Type	Conference
Year	2009
Where	KDD
Authors	Srivatsava Daruru, Nena M. Marin, Matt Walker, Joydeep Ghosh

Comments (0)

Sciweavers

Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data

Data Mining | Dataflow Cf Implementation | Dataflow Library | Efficient Dataflow Implementation | KDD 2009 |

Explore & Download

Productivity Tools

Sciweavers