Abstract—We present reliability solutions for adaptable Network RAM systems running on general-purpose clusters. Network RAM allows nodes with over-committed memory to swap pages...
Abstract—In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not supp...
For decades, the serialization constraints imposed by true data dependences have been regarded as an absolute limit--the dataflow limit--on the parallel execution of serial progra...
This paper describes PRISM, a distributed sharedmemory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance....
Replication is a technique used in Data Grid environments that helps to reduce access latency and network bandwidth utilization. Replication also increases data availability thereb...