Sciweavers

TPDS
2010

Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

13 years 3 months ago
Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation
—In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equat...
Jorge E. Pezoa, Sagar Dhakal, Majeed M. Hayat
Added 31 Jan 2011
Updated 31 Jan 2011
Type Journal
Year 2010
Where TPDS
Authors Jorge E. Pezoa, Sagar Dhakal, Majeed M. Hayat
Comments (0)