Loops are an important source of parallelism for ILP-processors. Therefore, the regularity of the control structure can speed up computation. Loop scheduling is a central point of instruction schedulers that have been advanced for highly parallel ILP-processors, including VLIWs.
There are two different types of loop scheduling are as follows −
The basic concept of loop unrolling is to repeat the loop body multiple times and to discard unnecessary inter-iteration code, including decrementing the loop count, verification for loop end, and branching back conditionally between iterations.
This will result in a shortened implementation time. Loop unrolling can be executed simply when the multiple iterations are already established at compile-time, which appears generally in ‘do’ and ‘for’ loops.
Loop unrolling stores performance time, at the cost of code length, in much the similar method as code inlining or traditional macro expansion. Code inlining is one of the standard compiler optimization approaches, used for short, occasionally used subroutines.
Code inlining represents adding the entire subroutine body each time the subroutine is known ‘at the point of call’. Rather than saving the subroutine body independently and calling it from the main code if required.
Simple unrolling is not applicable when a loop has to be implemented a huge number of times, or when the number of iterations is not fixed at compile time. In such cases, a simple loop of uncontrolling has to be extended. The usual method is to unroll the loop a given number of times, say three times, and to set up a loop for the resulting group of unrolled loops. Then the decrementing, testing for loop end, and conditional branching back code are only necessary for each group of unrolled loops.
Software pipelining is that consecutive loop iterations are implemented as if they were a hardware pipeline as displayed in the following table. Let us see cycles c+4, c+5, and c+6. These are the ones displaying the real advantage of software pipelining. The important point is that for these cycles the available parallelism between subsequent loop iterations is fully used. For instance, in cycle c+4 the parallel operations are as follows −
It can be storing the result of iteration 1 (that is, a(1)), auto-incrementing the index.
It can be decrementing the loop count, which is maintained in r200 by 1.
It can be performing the floating-point multiplication with the operands belonging to cycle 4, that is (2.0 * b(4));
It can be loading the operand for iteration 5 that is b(5).
Most parallel execution of the given loop on an ILP-processor with multiple pipelined execution units
The cycles c+4 to c+6 have a repetitive design of schedule. It can be restored by an equivalent loop. Each iteration of this new loop includes several operations related to multiple iterations of the initial loop −
loop:storei;decri+2;fmuli+3;loadi+4;bc loop; //the loop has to be executed for i = 1 to 3