Fault-Tolerant Distributed Simulation

13 years 11 months ago
Fault-Tolerant Distributed Simulation
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the participating LP crashes. This is highly undesirable for long running simulations. Some form of fault-tolerance is required to minimize the wasted computation. In this paper, a rollback based optimistic faulttolerance scheme is integrated with an optimistic distributed simulation scheme. In rollback recovery schemes, checkpoints are periodically saved on stable storage. After a crash, these saved checkpoints are used to restart the computation. We make use of the novel insight that a failure can be modeled as a straggler event with the receive time equal to the virtual time of the last checkpoint saved on stable storage. This results in saving of implementation e orts, as well as reduced overheads. We de ne stable global virtual time SGVT, as the virtual time such that no state with a lower timestamp will ever be rolled back despite crash failures. A simple change is made in existing...
Om P. Damani, Vijay K. Garg
Added 05 Aug 2010
Updated 05 Aug 2010
Type Conference
Year 1998
Where PADS
Authors Om P. Damani, Vijay K. Garg
Comments (0)