The Queen's University of Belfast
Parallel Computer Centre

[Next] [Previous] [Top]

4 Loops


The CF77 compiler tries to vectorize loops containing conditional statements and branches subjecting them to a series of rules.

4.1 IF loops and search loops

IF loop uses IF ... GOTO construct to control the repeated execution of a series of statements. A search loop is a DO loop that can be exited by means of an IF ... GOTO.

Both must satisfy the following requirements in order to be vectorized:

the loop must be executed the correct number of times - an early exit is any exit from the body of the loop that decreases the number of loop iterations from the trip count calculated for the loop

the correct exit must be taken

all scalar values must be correct when an exit is taken

Once the requirements are met the loops are vectorized as follows the IF expression is evaluated for the full set of loop iterations indicated by the loops DO statement; the point where the IF expression is satisfied indicates the vector length to be used for the other expressions in the loop. This length is used when those expressions are executed.

IF statements are vectorized as follows, set vector mask based on conditional result, if no bits in the vector mask are set that block of work is skipped, elements corresponding to true conditions are gathered, the result computed, and the result scattered back into memory.

Changes may be made to the code so that it complies with the requirements for vectorization. Some of the following should be adhered to in order to facilitate vectorization:

Code motion ie moving early exits - the compiler can move a loop's final conditional exit to the top of the block containing the exit thus the exit precedes the block's vectorizable statements so that the number of iterations are known before the vectorizable statements are executed.

Values computed following exit - condition for an early exit must not depend on values computed (ie in a previous iteration) in the portion of the loop following the exit.

Indirect addressing - this inhibits vectorization when the exit condition involves indirect addressing as this may lead to range errors.

4.2 Branches

Loops containing branches must comply with certain requirements if they are to be vectorized

branches into a loop from outside prohibit vectorization

loops containing backward branches cannot be vectorized since a backward branch is itself a loop

loops with forward branches permit vectorization

arithmetic IFs which are obsolete are not vectorizable since they have multiple destinations

4.3 Transformations

The compiler's first phase, ie FPP, carries out some transformations on source code to remove some of the conditions which inhibit vectorization eg:

converts IF loops to DO loops when they have a single entrance and a single exit

the compiler can analyse any combination of conditional assignments, conditional and unconditional forward branching, and block IFs. Compilation-speed restrictions mean there is a limit of 6 simultaneously active conditions

CF77 uses the CVMGT function to convert conditional reductions into a vectorizable form, if the vectorizable expression being reduced contains only +, -, *, count and logical operators.

4.4 Special case - loops

Short Vector Loops - this is a fully vectorizable loop with an iteration count less than or equal to 64, the entire loop may be executed with vector instructions and no looping constructs

Reduction Loops - reduces an array to a scalar value by doing a cumulative operation on all of the array's elements, this involves including the result of the previous iteration in the expression of the current iteration

Implied DO Loops - used to read or write the code, makes a call to a special entry point in the I/O library that vectorizes portions of the I/O operation. Although the I/O statement does not vectorize the implied DO loop within the statement does.

Array Reference

with constant index - an array reference with a constant index should be the equivalent of a scalar reference and treated as such but is not as it may cause confusion in dependence analysis

4.5 Loop nest restructuring

CF77 examines all IF and DO loops within a nest of loops for possible optimization so that if an outer loop should prove a more suitable candidate for vectorizing then it exchanges the two loops. Nest analysis is carried out by examining the loops innermost loops outward stopping when a nontranslatable construct is reached. The following is an example

DO I=1, N !N>>10

DO J=1,10

A(I,J) = B(I,J) * C(I,J)

END DO

END DO

Restructured or reordered so that the longest vectorizable loop is the innermost loop

DO J=1,10

DO I=1, N

A(I,J) = B(I,J) * C(I,J)

END DO

END DO

CF77 selects the best loop for vectorization according to the following criteria:-

loop iteration count

stride size of array references

percentage of data dependent code

percentage of conditionally executed code

4.6 Loop optimizations

CF77 performs loop optimizations in the following ways:

Loop collapse - converts a nest of loops into a single loop with a large iteration count

Loop fusion - combines consecutive loops with no statements between them

Example

DO I=1,N

A(I) = B(I) + C(I)

END DO

DO I=2, N+1

D(I) = E(I) **2

END DO

Combined into

DO I=1,N

A(I) = B(I) + C(I)

D(I+1) = E(I+1) **2

END DO

Source-level loop unrolling - makes a copy of the loop body for every iteration to be executed straight-line eg

DO I =1, N

DO J=1, 4

A(I,J) = B(I,J) * C(I,J)

END DO

END DO

Unrolled - vertically

DO I =1, N

A(I,1) = B(I,1) * C(I,1)

A(I,2) = B(I,2) * C(I,2)

A(I,3) = B(I,3) * C(I,3)

A(I,4) = B(I,4) * C(I,4)

END DO

Example

DO I =1, N

T = 0

DO J=1, 4

T = T + B(I,J)

END DO

A(I) = A(I) + T

END DO

Unrolled horizontally

DO I =1, N

A(I) = A(I) + B(I,1)+B(I,2)+B(I,3)+B(I,4)

END DO

Translation of array notation - it is translated into DO loops which can then be vectorized

VFUNCTION directive use - this directive is recognised and is treated as vector intrinsic

4.7 Other common optimization techniques

Many of the common optimization techniques are performed by FPP but the developer can aid this process. The aim is to give the compiler as large a fraction as possible of clearly vectorizable loops, ie minimize branching (IFs) within loops and DO loops in subroutines, no subroutine calls in loops etc.

Loop splitting - split non-vectorizing loops into loops containing vectorizable and non-vectorizable statements as in the next example:

DO I=2, N

A(I) = A(I-1) * D(I)

B(I) = A(I) + C(I-1) **2

END DO

Loop split version

DO I=2, N

A(I) = A(I-1) * D(I)

END DO

DO I=2, N

B(I) = A(I) + C(I-1) **2

END DO

Promoting scalar to vector - scalar recurrences may be avoided using a temporary vector eg

DO J=1, M

S = BB

DO I=1, N

S = S*C

A(I) = A(I) + S

END DO

END DO

Modified

DO I=1, M

TV(I) = BB

END DO

DO I=1, N

DO J=1, M

TV(J) = TV(J) + C

A(I) = A(I) + TV(J)

END DO

END DO

Using the PARAMETER statement - can improve compiler optimization by providing more compile time information about loop lengths and potential data dependencies

Example

DO I=2, N

A(I-1) = A(I-S) * B(I) !CONDITIONAL VECTOR LOOP

END DO

Modified

PARAMETER (S=1)

...

DO I=2, N

A(I-1) = A(I-S) * B(I) !VECTOR LOOP

END DO

IF block ordering - placing the most frequently executed conditions first in block IFs

Example

DO I=1, N

IF (A(I) .EQ. 0) THEN

B(I) = AZERO !RARE

ELSEIF (A(I) .LT. 0) THEN

B(I) = ANEG !RARE

ELSE

B(I) = C(I) * D(I)/A(I)

ENDIF

END DO

Modified

DO I=1, N

IF (A(I) .GT. 0) THEN

B(I) = C(I) * D(I)/A(I) ! MOST FREQUENT

ELSEIF (A(I) .EQ. 0) THEN

B(I) = AZERO

ELSE

B(I) = ANEG

ENDIF

END DO

Subroutine in-lining - subroutine and function calls inside a loop prevent vectorization. The loop can be vectorized by bringing the code from the subroutine or function in-line. This can be done manually, using the compiler, or for functions by using a statement function.

4.8 Compiler vectorization directives

FPP directives are lines appearing in source code beginning with the string CFPP$ and are used to give FPP more information about your program ie user directives. Compiler directives are lines beginning with CDIR$ or CDIR@ which are automatically inserted by FPP and are interpreted by the compiler as information about the program.

FPP user directives have the following syntax:

CFPP$ directive scope

Scope has the range of values - R for routinue, L for loop, F for file, I for immediate and if left blank applies to the next loop.

Some of the FPP directives are eg CFPP$:

VECTOR/NOVECTOR Turns vectorization on or off

CONCUR/NOCONCUR Disables/enables autotasking

NODEPCHK Ignores potential data dependencies

NEXTSCALAR Disables vectorization for the next DO loop between the directive and the end of the program unit

Other source code directives are eg CDIR$:

IVDEP (SAFEVL = n) Indicates that any dependencies can be ignored if the vector length does not exceed n - only used when it is known that any apparent dependencies will not cause invalid results if a loop is vectorized

VFUNCTION Declares that a vector version of an external function exists

SHORTLOOP Allows the compiler to generate faster code when a loop's trip count is 64 or less

VSEARCH/NOVSEARCH Default is VSEARCH as this enables vectorization of all search loops until a NOVSEARCH is encountered

RECURRENCE/NORECURRENCE

Enables or disables vectorization of all reduction loops again until a NORECURRENCE is encountered or the end of the program unit


[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker