Journal of Parallel and Distributed Computing, in press.
In this paper we propose and evaluate a new data-prefetching technique
for cache coherent multiprocessors. Prefetches are issued by a functional
unit called a prefetch engine which is controlled by the compiler.
We let second-level cache misses generate cache miss traps, and start the
prefetch engine in a trap handler. The trap handler is fast (40-50 cycles)
and does not normally delay the program beyond the memory latency of the
miss. Once started, the prefetch engine executes on its own and causes
no instruction overhead. The only instruction overhead in our approach
is when a trap handler completes after data arrives. The advantages of
this technique are (1) it exploits static compiler analysis to determine
what to prefetch which is hard to do in hardware, (2) it uses prefetching
with very little instruction overhead, which is a limitation for traditional
software-controlled prefetching, and (3) it is accurate in the sense that
it generates very little useless traffic while maintaining a high prefetching
coverage. We also study whether one could emulate the prefetch engine in
software, which would not require any additional hardware beyond support
for generating cache miss traps and ordinary prefetch instructions.
In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimising compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%.