Abstract. We present new performance models and a new, more compact data structure for cache blocking when applied to the sparse matrixvector multiply (SpM×V) operation, y ← y +...
Rajesh Nishtala, Richard W. Vuduc, James Demmel, K...
Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the ...
Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximi...
Current superscalar processors, both RISC and CISC, require substantial instruction fetch and decode bandwidth to keep multiple functional units utilized. While CISC instructions ...
In this paper, we present an overview of extensible Retrieval, Annotation and Caching Engine (eRACE), a modular and distributed intermediary infrastructure that collects informati...