The ability to checkpoint a running application and restart it later can provide many useful benefits including fault recovery, advanced resources sharing, dynamic load balancing...
Recovery systems must save state before a failure occurs to enable the system to recover from the failure. However, recovery will fail if the recovery system saves any state corru...
distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. Howeve...
We introduce Membrane, a set of changes to the operating system to support restartable file systems. Membrane allows an operating system to tolerate a broad class of file system f...
We present SGuard, a new fault-tolerance technique for distributed stream processing engines (SPEs) running in clusters of commodity servers. SGuard is less disruptive to normal s...
YongChul Kwon, Magdalena Balazinska, Albert G. Gre...