Sciweavers

PPOPP
2010
ACM
14 years 1 months ago
Data transformations enabling loop vectorization on multithreaded data parallel architectures
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memo...
Byunghyun Jang, Perhaad Mistry, Dana Schaa, Rodrig...
PPOPP
2010
ACM
14 years 1 months ago
Symbolic prefetching in transactional distributed shared memory
We present a static analysis for the automatic generation of symbolic prefetches in a transactional distributed shared memory. A symbolic prefetch specifies the first object to be...
Alokika Dash, Brian Demsky
PPOPP
2010
ACM
14 years 1 months ago
A distributed placement service for graph-structured and tree-structured data
Effective data placement strategies can enhance the performance of data-intensive applications implemented on high end computing clusters. Such strategies can have a significant i...
Gregory Buehrer, Srinivasan Parthasarathy, Shirish...
PPOPP
2010
ACM
14 years 1 months ago
Using data structure knowledge for efficient lock generation and strong atomicity
To achieve high-performance on multicore systems, sharedmemory parallel languages must efficiently implement atomic operations. The commonly used and studied paradigms for atomici...
Gautam Upadhyaya, Samuel P. Midkiff, Vijay S. Pai
PPOPP
2010
ACM
14 years 1 months ago
Continuous speculative program parallelization in software
This paper addresses the problem of extracting coarse-grained parallelism from large sequential code. It builds on BOP, a system for software speculative parallelization. BOP lets...
Chao Zhang, Chen Ding, Xiaoming Gu, Kirk Kelsey, T...
PPOPP
2010
ACM
14 years 1 months ago
Fast tridiagonal solvers on the GPU
We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (...
Yao Zhang, Jonathan Cohen, John D. Owens
PPOPP
2010
ACM
14 years 1 months ago
A practical concurrent binary search tree
We propose a concurrent relaxed balance AVL tree algorithm that is fast, scales well, and tolerates contention. It is based on optimistic techniques adapted from software transact...
Nathan Grasso Bronson, Jared Casper, Hassan Chafi,...
PPOPP
2010
ACM
14 years 1 months ago
The LOFAR correlator: implementation and performance analysis
LOFAR is the first of a new generation of radio telescopes. Rather than using expensive dishes, it forms a distributed sensor network that combines the signals from many thousands...
John W. Romein, P. Chris Broekema, Jan David Mol, ...
PPOPP
2010
ACM
14 years 1 months ago
Featherweight X10: a core calculus for async-finish parallelism
We present a core calculus with two of X10's key constructs for parallelism, namely async and finish. Our calculus forms a convenient basis for type systems and static analys...
Jonathan K. Lee, Jens Palsberg
PPOPP
2010
ACM
14 years 1 months ago
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also r...
Eddy Z. Zhang, Xipeng Shen, Yunlian Jiang