Understanding the effects and implications of compute node related failures in hadoop

13 years 7 months ago

Download www.cs.rice.edu

Hadoop has become a critical component in today’s cloud environment. Ensuring good performance for Hadoop is paramount for the wide-range of applications built on top of it. In this paper we analyze Hadoop’s behavior under failures involving compute nodes. We ﬁnd that even a single failure can result in inﬂated, variable and unpredictable job running times, all undesirable properties in a distributed system. We systematically track the causes underlying this distressing behavior. First, we ﬁnd that Hadoop makes unrealistic assumptions about task progress rates. These assumptions can be easily invalidated by the cloud environment and, more surprisingly, by Hadoop’s own design decisions. The result are signiﬁcant inefﬁciencies in Hadoop’s speculative execution algorithm. Second, failures are re-discovered individually by each task at the cost of great degradation in job running time. The reason is that Hadoop focuses on extreme scalability and thus trades off possible ...

Florin Dinu, T. S. Eugene Ng

Real-time Traffic

Computer Systems Organization | Distributed And Parallel Computing | Execution Algorithm | Failure Semantics | HPDC 2012 |

claim paper

Post Info
More Details (n/a)

Added	29 Sep 2012
Updated	29 Sep 2012
Type	Journal
Year	2012
Where	HPDC
Authors	Florin Dinu, T. S. Eugene Ng

Comments (0)

Sciweavers

Understanding the effects and implications of compute node related failures in hadoop

Computer Systems Organization | Distributed And Parallel Computing | Execution Algorithm | Failure Semantics | HPDC 2012 |

Explore & Download

Productivity Tools

Sciweavers