Sciweavers

EDBT
2010
ACM

Optimizing joins in a map-reduce environment

13 years 11 months ago
Optimizing joins in a map-reduce environment
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to det...
Foto N. Afrati, Jeffrey D. Ullman
Added 18 May 2010
Updated 18 May 2010
Type Conference
Year 2010
Where EDBT
Authors Foto N. Afrati, Jeffrey D. Ullman
Comments (0)