An asynchronous work-stealing implementation of dynamic load balance is implemented using Unified Parallel C (UPC) and evaluated using the Unbalanced Tree Search (UTS) benchmark ...
Effective overlap of computation and communication is a well understood technique for latency hiding and can yield significant performance gains for applications on high-end compu...
Aniruddha G. Shet, P. Sadayappan, David E. Bernhol...
There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are so...
Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory lat...
This paper focuses on non-strict processing, optimization, and partial evaluation of MPI programs which use incremental data structures (ISs). We describe the design and implement...