Optimization of Collective Reduction Operations

10 years 6 months ago
Optimization of Collective Reduction Operations
A 5-year-profiling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Message Passing Interface (MPI) routines is spent in the collective communication routines MPI Allreduce and MPI Reduce. Although MPI implementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors’ and publicly available reduction algorithms could be accelerated with new algorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents five algorithms optimized for different choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation. Keywords. Message Passing, MPI, Collective Operations, Reduction.
Rolf Rabenseifner
Added 01 Jul 2010
Updated 01 Jul 2010
Type Conference
Year 2004
Where ICCS
Authors Rolf Rabenseifner
Comments (0)