Sciweavers

148 search results - page 16 / 30
» Recovery From Software Failures Caused by Mandelbugs
Sort
View
ICDCS
2000
IEEE
15 years 2 months ago
Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems
Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel compu...
Angkul Kongmunvattana, Santipong Tanchatchawal, Ni...
ANSS
2007
IEEE
15 years 1 months ago
An Accurate and Efficient Time-Division Parallelization of Cycle Accurate Architectural Simulators
This paper proposes a parallel cycle-accurate microarchitectural simulator which efficiently executes its workload by splitting the simulation process along time-axis into many in...
Masahiro Yano, Toru Takasaki, Takashi Nakada, Hiro...
141
Voted
CBSE
2011
Springer
13 years 9 months ago
Rectifying orphan components using group-failover in distributed real-time and embedded systems
Orphan requests are a significant problem for multi-tier distributed systems since they adversely impact system correctness by violating the exactly-once semantics of application...
Sumant Tambe, Aniruddha S. Gokhale
LCPC
2007
Springer
15 years 3 months ago
Compiler-Enhanced Incremental Checkpointing
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety o...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
93
Voted
MOBISYS
2007
ACM
15 years 9 months ago
NodeMD: diagnosing node-level faults in remote wireless sensor systems
Software failures in wireless sensor systems are notoriously difficult to debug. Resource constraints in wireless deployments substantially restrict visibility into the root cause...
Veljko Krunic, Eric Trumpler, Richard Han