Parallel dataflow programs generate enormous amounts of distributed data that are short-lived, yet are critical for completion of the job and for good run-time performance. We ca...
Steven Y. Ko, Imranul Hoque, Brian Cho, Indranil G...
In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel’s SSE and Motorola’s AltiVec that su...
As on-chip integration matures, single-chip system designers must not only be concerned with component-level issues such as performance and power, but also with onchip system-leve...
Control independence has been put forward as a significant new source of instruction-level parallelism for future generation processors. However, its performance potential under p...
Abstract. Conventional performance environments are based on pro ling and event instrumentation. It becomes problematic as parallel systems scale to hundreds of nodes and beyond. A...
Xian-He Sun, Mario Pantano, Thomas Fahringer, Zhao...