The growing disparity between processor and memory speeds has caused memory bandwidth to become the performance bottleneck for many applications. In particular, this performance ga...
In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and Φ, compare and swap, load linked, and store conditionalon large-scal...
We present a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and...
— Skewed-associative caches use several hash functions to reduce collisions in caches without increasing the associativity. This technique can increase the hit ratio of a cache w...
Henk L. Muller, Paul W. A. Stallard, David H. D. W...
This paper describes VOL2, an interactive general-purpose volume renderer based on ray casting and implemented on Pixel-Planes 5, a distributed-memory, message-passing multicomput...
Andrei State, Jonathan McAllister, Ulrich Neumann,...