In this paper, we consider a generic model of computational grids, seen as several clusters of homogeneous processors. In such systems, a key issue when designing efficient job al...
The research community has witnessed a large interest in monitoring large scale distributed systems. In these applications typically we wish to monitor a global system condition wh...
Ali Abbasi, Ahmad Khonsari, Mohammad Sadegh Talebi
We present Zyzzyva, a protocol that uses speculation to reduce the cost and simplify the design of Byzantine fault tolerant state machine replication. In Zyzzyva, replicas respond...
Ramakrishna Kotla, Lorenzo Alvisi, Michael Dahlin,...
Recent technological advances have opened up a wide range of distributed real-time applications involving battery-driven embedded devices with local processing and wireless communi...
G. Sudha Anil Kumar, Govindarasu Manimaran, Zhengd...
—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...