Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in the...
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Abstract—Distributed interactive applications may have stringent latency requirements and dynamic user groups. These applications may benefit from a group communication system, ...
The advent of large scale multi-hop wireless networks highlights problems of fault tolerance and scale in distributed system, motivating designs that autonomously recover from tra...
The general problem of verifying coherence for shared-memory multiprocessor executions is NP-Complete. Verifying memory consistency models is therefore NP-Hard, because memory con...