We describe a methodology that enables the real-time diagnosis of performance problems in complex high-performance distributed systems. The methodology includes tools for generati...
Brian Tierney, William E. Johnston, Brian Crowley,...
Design templates that involve discovery, analysis, and integration of information resources commonly occur in many scientific research projects. In this paper we present examples o...
Joel H. Saltz, Scott Oster, Shannon Hastings, Step...
Timing failures refer to a situation where the environment in which a system operates does not behave as expected regarding the timing assumptions, that is, the timing constraints...
High-performance computing clusters running longlived tasks currently cannot have kernel software updates applied to them without causing system downtime. These clusters miss oppo...
The paper addresses the problem of matching and scheduling of DAG-structured application to both minimize the makespan and maximize the robustness in a heterogeneous computing sys...