This scheme employs a more cache, known as the branch target address cache (BTAC), for speeding up access to branch targets as shown in the figure. The BTAC includes a group of currently used branch addresses and branch target addresses and is accessed relatively.
When the actual instruction fetch address is a branch address, and there is an equivalent entry in the BTAC, the branch target address is fetched along with the branch instruction in a similar cycle. This BTA is then used to access the branch target instruction in the next cycle.
The Branch Target Address Cache (BTAC) includes branch target addresses (BTAs). These BTAs are read from the BTAC at the same time as the branch instruction is fetched.
In this way branch target instructions (BTIs) may be fetched immediately in succession to the branch instructions, that is without any idle cycles. Furthermore, the BTAC scheme even has the potential to implement zero-cycle branching. With zero-cycle branching, the first target instruction can be fetched immediately after the last sequential instruction preceding a branch without any delay.
For zero-cycle branching, the branch target address (BTA) must be accessed along with the instruction preceding the branch. Then the BTAC must contain instead of the branch address (BA), the instruction fetch address preceding the branch addresses. For a scalar processor with 4-byte instruction, this would be the address BA – 4.
The BTAC scheme was proposed by Lee and Smith (1984) and has been called Branch target buffer design. This scheme is implemented in some recent processors, as shown in the table. The number of BTAC entries varies from 32 to 4K.
Example of processors using the BTAC scheme
|Processor||Number of BTAC entries||Implementation of the BTAC|
|ES/9000 520-based processors (1992p)||4k||2-way associative|
|Pentium (1994)||256||Fully associative|
|MC 68060 (1993)||256||4-way associative|
|PA 8000 (1995)||32||Fully associative|
|PowerPC 604 (1994)||64||Fully associative|
|PowerPC 620 (1995)||256||Fully associative|
There are some differences in the implementation of the BTAC scheme, especially concerning the following issues −
Whether the BTAC is implemented as a 2-way, 4-way of fully-associative cache.
How the BTAC is initialized.
Whether entries are retained in the BTAC for all recent branches or only for recently taken branches (in the latter case the BTAC scheme also performs implicit dynamic prediction).
How to select the entry to be overwritten, if there is no room in the BTAC for a new entry.
If the processor uses predict bits, whether they are contained in the BTAC or a separate BHT.