CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

14 years 4 months ago

Download www.mpi-inf.mpg.de

Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. In contrast to previous approaches, CoHadoop retains the ﬂexibility of Hadoop in that it does not require users to convert their data to a certain format (e.g., a relational database or a speciﬁc ﬁle format). Instead, applications give hints to CoHadoop that some set of ﬁles are related and may be processed jointly; CoHadoop then tries to colocate these ﬁles for improved efﬁciency. Our approach is designed such that the strong fault tolerance properties of Hadoop are retained. Colocation can be used to improve the efﬁciency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessi...

Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özc

Real-time Traffic

Applied Computing | Attractive Platform | Fault Tolerance | Performance Bottleneck | PVLDB 2011 |

claim paper

Post Info
More Details (n/a)

Added	17 Sep 2011
Updated	17 Sep 2011
Type	Journal
Year	2011
Where	PVLDB
Authors	Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, John McPherson

Comments (0)

Sciweavers

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

Applied Computing | Attractive Platform | Fault Tolerance | Performance Bottleneck | PVLDB 2011 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers