In this section, we are concerned with an important performance measure of pipelined load/store processing such as load-use delay. The value of the load-use delay is a characteristic attribute of pipelined execution of loads. Large load-use values can seriously impede processor performance, especially in a superscalar processor.
Load-use delays arise from load-use dependency, a kind of RAW dependency. Load-use dependency gives rise to a load-use delay if the outcome of the load instruction cannot be made accessible by the pipeline in due time for the subsequent instruction.
A Load-use delay can be handled either statistically or dynamically. If the static resolution is used, the compiler tries to insert as many independent instructions as necessary between the load instruction and the consumer instruction to compensate for the load delay.
MIPS computers, such as the R2000 and R3000 are examples of handling load-use delays by static scheduling. Here the compiler is assumed to insert a load delay slot after all load instructions. This slot is filled by the compiler either with an independent instruction or with the NOP.
The other most frequently used technique for handling load-use delays is dynamic scheduling. Here, dedicated hardware is responsible for detecting and resolving hazards that can harm sequential consistency. In general, the values of the load-use delays depend on the principle layout and the implementation of caches.
Most current processors have a load-use delay of one cycle, and a few have two or three cycles as shown in the table. There are several processors in which load-use delays are eliminated such as the Intel i486, the Pentium, the SuperSparc, and the R8000.
Values of load-use delays (in cycles) for current processors
|0|| Load-use delay (in cycles)|
|2 or 3|
|i486||MIPS X||MC 88100|
|SuperSparc||R6000||α21064 (3 cycles)|
|R8000||PA 7100, PA 7200 Power1 (RS/6000) PowerPC 601, 603 Power2||α21164 (2/3 cycles)|
For traditional scalar processors, load-use delays of one cycle are quite acceptable, since a parallel optimizing ILP-compiler will frequently find independent instruction to fill the slot following a load.
However, for a superscalar processor with instruction issue rates of 2 and higher, it is much less probable that the compiler can find, for each load instruction, two, three, four, or more independent instruction. Thus, with increasing, instruction issue rate in superscalar processor load-use delays become a bottleneck.
According to these results, an increase of the load-use delay from one to two or three cycles will reduce speed-up considerably. For instance, at an issue rate of 4, a load-use delay of 2 will impede performance by about 30% when compared with a load-use delay of 1. Although these figures are valid only for a certain set of parameters, a general tendency such as this can be expected.