Large-scale cluster-based Internet services often host partitioned datasets to provide incremental scalability. The aggregation of results produced from multiple partitions is a f...
Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence...
The Sony-Toshiba-IBM Cell Broadband Engine is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD coprocessing units (SPEs) ...
This paper presents our experience developing applications in Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade progr...
The Virtual Interface (VI) Architecture provides protected userlevel communication with high delivered bandwidth and low permessage latency, particularly for small messages. The V...