Optimizing joins in a map-reduce environment

13 years 11 months ago

Download infolab.stanford.edu

Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identiﬁer of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a ﬁxed number of Reduce processes. An algorithm for detecting and ﬁxing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to det...

Foto N. Afrati, Jeffrey D. Ullman

Real-time Traffic