A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant progr...
It is an important problem to map virtual parallel processes to physical processors (or cores) in an optimized way to get scalable performance due to non-uniform communication cost...
Jin Zhang, Jidong Zhai, Wenguang Chen, Weimin Zhen...
This paper presents a high-level approach for assessing the performance behavior of complex scientific applications running on a high-performance system through simulation. The pr...
Thomas Fahringer, Nicola Mazzocca, Massimiliano Ra...
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, ...
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...