This paper describes a proposal for a set of Parallel Basic Linear Algebra Subprograms PBLAS. The PBLAS are targeted at distributed vector-vector, matrix-vector and matrixmatrix...
Jaeyoung Choi, Jack Dongarra, Susan Ostrouchov, An...
QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive sys...
The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is espe...
Tiling has long been used to improve cache performance. Recursion has recently been used as a cache-oblivious method of improving cache performance. Both of these techniques are n...
Joon-Sang Park, Michael Penner, Viktor K. Prasanna
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring ...
Induprakas Kodukula, Keshav Pingali, Robert Cox, D...