Sciweavers

PPOPP
2015
ACM

Optimization for performance and energy for batched matrix computations on GPUs

8 years 9 days ago
Optimization for performance and energy for batched matrix computations on GPUs
As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do no...
Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stani
Added 16 Apr 2016
Updated 16 Apr 2016
Type Journal
Year 2015
Where PPOPP
Authors Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, Jack J. Dongarra
Comments (0)