An important requirement for the effective scheduling of parallel applications on large heterogeneous clusters is a current view of system resource availability. Maintaining such ...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
Real-time monitoring is increasingly becoming important in various scenes of large scale, multi-site distributed/parallel computing, e.g, understanding behavior of systems, schedu...
To observe, analyze and control large scale distributed systems and the applications hosted on them, there is an increasing need to continuously monitor performance attributes of ...
Shicong Meng, Srinivas R. Kashyap, Chitra Venkatra...