The number of processors embedded on high performance computing platforms is continuously increasing to accommodate user desire to solve larger and more complex problems. However,...
Thara Angskun, George Bosilca, Graham E. Fagg, Jel...
The term “Self-healing” denotes the capability of a software system in dealing with bugs. Fault tolerance for dependable computing is to provide the specified service through ...
Scalable and fault tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic larg...
Thara Angskun, Graham E. Fagg, George Bosilca, Jel...
The number of processors embedded in high performance computing platforms is growing daily to solve larger and more complex problems. The logical network topologies must also suppo...
This paper explores the challenges associated with distributed application management in large-scale computing environments. In particular, we investigate several techniques for e...
Nikolay Topilski, Jeannie R. Albrecht, Amin Vahdat