Sciweavers

CCGRID
2008
IEEE

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

13 years 6 months ago
Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems
Abstract--The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.
Jim M. Brandt, Bert J. Debusschere, Ann C. Gentile
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CCGRID
Authors Jim M. Brandt, Bert J. Debusschere, Ann C. Gentile, Jackson Mayo, Philippe P. Pébay, David Thompson, Matthew Wong
Comments (0)