Most of today's multiprocessors have a DistributedShared Memory (DSM) organization, which enables scalability while retaining the convenience of the shared-memory programming...
Angeles G. Navarro, Rafael Asenjo, Emilio L. Zapat...
Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop con...
Roger Ferrer, Alejandro Duran, Xavier Martorell, E...
This paper presents a compile-time scheme for partitioning non-rectangular loop nests which consist of inner loops whose bounds depend on the index of the outermost, parallel loop...
This paper addresses the scheduling of uniformdependence loop nests within the framework of the bulksynchronous parallel (BSP) model. Two broad classes of tightly-nested loops are...
Data locality is critical to achievinghigh performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus ...