Software pipelining is a compile-time scheduling technique that overlaps subsequent loop iterations to disclose operation-level parallelism. A necessary issue with the development of adequate software pipelining algorithms is how to deal with loops with conditional branches. Conditional branches raise the complexity and reduce the performance of software pipelining algorithms by offering few possible execution paths into the scheduling opportunity.
In order to demonstrate the underlying idea let us look at the most feasible parallel execution of a loop on an ILP-processor which has multiple execution units that operate in parallel. Let us assume a RISC-like intermediate code for the loop body such as −
load r100, b(i); fmul r100, 2.0, r100; store a(i), r100;
While demonstrating the principle of software pipelining, we focus only on the loop body, neglecting some prologue and epilogue code.
It can create the following assumptions about the ILP-processor. It has independent execution units for FP, FX, load, and store instructions, all adequate of parallel operations. All execution units should allow a new operation to be initiated in every cycle. Finally, we suppose that the FP unit delivers the result of the fmul instruction in say, three cycles, whereas load and stores have an execution latency of one cycle.
Now let us look for the most parallel execution feasible. We achieve it when we unroll the loop and subsequent iterations in as parallel a fashion as possible. Let us start with the first iteration. It can be executed as follows −
|C||load r101, b(1);||// b(1) is loaded|
|c+1||fmul r101, 2.0, r101,|
|c+2||decr r200;||// decrement loop count|
|c+3||Nop||// wait for result of fmul|
|c+4||store a(i) +, r101;||// store a(i), autoincrement i.|
Here we note that the supposed latency of the final operation is three cycles, therefore in cycle c + 3 its result is not yet available and a ‘nop’ has to be inserted. Under the assumption made, the second iteration can be initiated in cycle c+1 by loading the second data item, b(2).
It can avoid interference with the first iteration that is to avoid a WAW conflict in using r101, in the second iteration r101 has to be renamed, say to r102. Subsequently, both iterations can be implemented in parallel. This can receive the following execution sequence −
|C||load r101, b(1);|
|c+1||fmul r101, 2.0, r101,||load r102, b(2);|
|c+2 r102,||decr||fmul r102, 2.0,|
|c+4||store a(i) +, r101;||Nop|
|c+5||store a(2) +, r102;|
In the next table, it can show the entire execution of the loop. Let us see cycles c+4, c+5, and c+6. These are the ones displaying the real advantage of software pipelining. The important point is that for these cycles the available parallelism between subsequent loop iterations is completely used. For instance, in cycle c+4 the parallel operations are as follows −
It can be storing the result of iteration 1 (that is, a(1)), auto-incrementing the index.
It can be decrementing the loop count, which is maintained in r200 by 1.
It can be performing the floating-point multiplication with the operands belonging to cycle 4, that is (2.0 * b(4));
It can be loading the operand for iteration 5 that is b(5).
Most parallel execution of the given loop on an ILP-processor with multiple pipelined execution units