Management of program data to improve data locality and reduce false sharing is critical for scaling performanceon NUMA shared memorymultiprocessors. We use HPF-like data decomposi...
The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level...
Samira Manabi Khan, Zhe Wang, Daniel A. Jimé...
Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale s...
provide tools or abstractions that allow developers to program in parallel. But what hardware do we need to support shared memory threads? The hardware should provide a well-defin...
As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include ...