We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizatio...
Samuel Williams, Jonathan Carter, Leonid Oliker, J...
The formal analysis of parallelism and pipelining is performed on an 8-bit Add-Compare-Select element of a Viterbi decoder. The results are quantified through a study of the delay...
The embedded-block coding with optimized truncation (EBCOT), which consists of a bit-plane coder (BPC) and a binary arithmetic coder (BAC), is the bottleneck in realizing a high-p...
As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include ...
Abstract. Limited bandwidth to off-chip main memory is a performance bottleneck in chip multiprocessors for streaming computations, such as Cell/B.E., and this will become even mor...