Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in the...
Abstract-- We consider problems where multiple agents cooperate to control their individual state so as to optimize a common objective while communicating with each other to exchan...
Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, o...
Teams that are geographically distributed often share information both in real-time and asynchronously. When such sharing is through groupware, change conflicts can arise when peo...
Mark S. Hancock, John David Miller, Saul Greenberg...
In this paper, we describe two mission critical applications currently deployed by Telecom Italia in the Operations Support System domains. The first one called "Network Neut...