How to remove Load-use delay in Computer Architecture?

The layout of a processor pipeline affects load-use delay. The figure shows the traditional RISC, MIPS, and CISC pipeline layouts and the associated load-use delays.

In the case of a traditional four-stage RISC pipeline, first, the registers are accessed for the components of an address calculation, such as the content of a specified base or index register, in the D stage. Next in the E stage, the effective (virtual) address is calculated using the FX adder. At the end of this cycle, the virtual address can be sent to the MMU and/or to the cache. Assuming a high-performance cache, data will be available at the end of the next pipeline cycle, resulting in a load delay of one cycle.

In the case of the traditional MIPS pipeline, the virtual address is again sent out at the end of the E stage. Assuming once more a single-cycle cache latency, the requested data arrives from the cache at the end of the C cycle. Thus, a traditional MIPS pipeline also has a load-use delay of one cycle.

On the other hand, a traditional CISC pipeline is designed to process register-memory instructions. As a consequence, it is laid out such that referenced memory data can be used even in the E stage of the same instruction as shown in the figure.

Thus the layout does not cause a load-use delay at all. However, due to the larger number of pipeline stages, more instructions are executed in parallel, and therefore more dependent instructions can be expected to occur than in the case of four or five-stage pipelines. This fact can unfavorably affect performance.

It can assume a high-performance cache capable of accessing data in one cycle, including address translation and assuming cache hits. For slower caches, the load-use delay is longer, assuming that no special effort is made. Next, we show techniques for reducing the load-use delay for slower caches. Slower caches were matched into the pipeline layout by shifting ahead of the address calculation process by either a half or an entire pipeline cycle as shown in the figure.

For instance, in the R2000 and R3000 processors, the address calculation takes place in the first half of the E cycle. The same is also valid for the high-performance HP 7100. The processor is unique as it uses an off-chip cache, which explains the need for forwarding the address calculation subtask.

There are several processors such as Am 29000 or the R6000, even shift the address calculations into the last phase of decoding (D) stage.