Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addi...
Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, ...
Abstract. When parallelizing loop nests for distributed memory parallel computers, we have to specify when the different computations are carried out (computation scheduling), wher...
Alain Darte, Claude G. Diderich, Marc Gengler, Fr&...
We present a new fast and scalable matrix multiplication algorithm, called DIMMA Distribution-Independent Matrix Multiplication Algorithm, for block cyclic data distribution on ...
Several fast sequential algorithms have been proposed in the past to multiply sparse matrices. These algorithms do not explicitlyaddresstheimpactofcachingonperformance. We show th...