CRAY Y-MP EL VECTOR PROCESSING, QUB

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

VECTOR PROCESSING

Vector Processing

Special Case of SISD
Simple but powerful technique (x10). Uses a technique known as pipelining.
A SCALAR is a single value and therefore Scalar processing consists of performing operations on scalars one at a time.
A VECTOR is a series of values eg a Fortran array or a sub-section of it.
Some operations, such as ADD, consist of multiple steps.
VECTORIZATION allows these internal steps to proceed simultaneously thus processing as many elements as the number of steps in the operation.

Consider a simple illustration - building chairs from kits.
A four stage process - one time step each.

Vector Multiply

The multiply process is made up of a number of stages.
Vector processors have hardware which allow these stages to work independently and pass results to each other in a "assembly line" manner.
Special registers known as VECTOR registers are used to hold data for a vector operation. These are "loaded" before a vector operation takes place.

Consider the following piece of Fortran:

DO 10 I=1,100

A(I) = B(I) * C(I)

10 CONTINUE

If this calculation is performed in serial mode each addition will take 4 clock cycles. The whole loop will take MORE than 400 clock cycles.
If this calculation was performed in vector mode the following would take place:
- A series of elements from B would be loaded into one vector register and a series of elements from C would be loaded into a second vector register (these can be done simultaneously)
- The first two elements of B and C would be fed into the multiply pipeline. As they progress along the pipeline more elements are fed in. After 4clock cycles the first result will be produced. This is stored in a third vector register ie A.
- The values in the third vector register are stored in the main memory at the location of array A.
- Time for the multiply alone 103 clock cycles.

Pseudo Code

VLOAD B VR1

VLOAD C VR2

VMULT VR1 VR2 VR3

VSTORE VR3 C

Chaining

The results of one vector operation may be fed into another. Multiple vector operations can be chained into one long operation - changing the ORDER of processing.

load instructions may be overlapped
once the first elements of B and C are available the multiply operation may start
when the first results are ready from the multiply operation then the Store instruction may begin
for long vectors all 4 operations are executing in parallel

CRAY YMP Vector Processor - example timing:

Vector Load : 17 clock cycles

Vector Multiply : 4 clock cycles

Vector Store : 17 clock cycles

Total : 38 clock cycles

The first result would be produced after 38 clock cycles and one would be produced on every clock cycle after that. So the total number of cycles required would be:
38 + 99 = 137
which is faster than the scalar time of 400

Vector Hardware

Vector functional units

pipelined, fully segmented
each stage of the pipeline performs a step of the function on different operands
once the pipeline is full, a new result is produced each clock period(cp)

Vector registers and pipeline

Pipelining and Chaining

a pipeline can segment the arithmetic operation ie passing the output of one stage to the next stage as input
the size of a pipeline can be increased by chaining
chaining means the result from a pipeline can be used as an operand in a second pipeline

S(I) = A * X(I) + Y(I)

Comparison - vector and scalar operation

A scalar operation works on only one pair of operands from the S register and returns the result to another S register.
A vector operation can work on 64 pairs of operands together to produce 64 results executing only one instruction.
Vector overhead is larger than scalar - the vector length is computed to find out how many vector registers are needed.
A vector register holds up to 64 words - processed in 64 element segments
Decreases in performance at each point in which the vector length spills over into a new register - avoid situations where the number of elements to be processed exceeds the register capacity by a small amount eg vector length of 65.
Cray can receive a result by a vector register and retransmit it as an operand to a subsequent operation in one clock period
A register may be both a result and an operand register allowing the chaining of two or more vector operations together.
2 or more results - per clock cycle.

Scalar and Vector Processing Example

DO 10 I = 1, 256

JJ(I) = KK(I)+LL(I)

10 CONTINUE

Both methods of processing must produce the same results for each array element.

Scalar Processing

Read one element of Fortran array KK
Read one element of LL
Add the results
Write the results to the Fortran array JJ
Increment the loop index by 1
Repeat the above sequence for each succeeding array element until the loop index equals its limit.

Vector Processing

Load a series of elements from array KK to a vector register and a series of elements from array LL to another vector register (these operations occur simultaneously except for instruction issue time)
Add the corresponding elements from the two vector registers and send the results to another vector register, representing array JJ
Store the register used for array JJ to memory
Repeat this sequence if the array had more elements than the maximum elements used in vector processing ie 64.

Scalar and Vector Processing Example

DO 10 I = 1, 3

L(I) = J(I)+K(I)

N(I) = L(I)+M(I)

10 CONTINUE

Both methods of processing must produce the same results for each array element.

Scalar Processing Order and Results

The two statements are each executed three times, with the operations alternating;

L(I) is calculated before N(I) in each iteration L(I) is then used to calculate the value of N(I).

Results of Scalar processing

Event Operation Values

1 L(1) = J(1)+K(1) 7 = 2 + 5

2 N(1) = L(1)+M(1) 11 = 7 + 4

3 L(2) = J(2)+K(2) -1 = (-4) + 3

4 N(2) = L(2)+M(2) 5 = (-1) + 6

5 L(3) = J(3)+K(3) 15 = 7 + 8

6 N(3) = L(3)+M(3) 13 = 15 + (-2)

Vector Processing Order and Results

Vector processing changes the order in which operations are performed on individual array elements. First line within the loop processes all elements of the array before the second line is executed.

Results of Vector processing

Event Operation Values

1 L(1) = J(1)+K(1) 7 = 2 + 5

2 L(2) = J(2)+K(2) -1 = (-4) + 3

3 L(3) = J(3)+K(3) 15 = 7 + 8

4 N(1) = L(1)+M(1) 11 = 7 + 4

5 N(2) = L(2)+M(2) 5 = (-1) + 6

6 N(3) = L(3)+M(3) 13 = 15 + (-2)

Vector Performance

This is governed by a number of factors eg

Architecture (machine dependent factors)
- relative scalar to vector instruction speed
- pipeline length
- pipeline startup cost

Software (dependent factors)
- vector length ie using optimal vector length
- vector code fraction ie increasing this

Terminology

Vector - a series of values on which instructions operate eg array or array subsets (ie columns, rows, diagonals where the intervals between the array elements locations are constant)
Vector Length - number of elements in a vector ie maximum of 64
Stride - interval between memory locations for successive elements of a vector, vectorization requires a constant stride ie an interval that is the same for all consecutive elements of a vector
Dependency - anything causing scalar and vector code to give different results, a recurrence or data dependency is an expression within a loop requiring a value calculated in a previous iteration
Chime - a sequence of vector operations that can be chained with a single vector load and store
Constant increment integer - scalar integer variable whose value changes a fixed amount on every iteration of a loop
Indirect-address vector - an array which is referenced by an index array
Vector array reference - an array which is referenced within a loop by a constant increment integer
Invariant array elements - array references which do not change within a loop
Loop induction variable - constant loop increment integer used as array reference
Invariant expression - a constant, variable, or expression which does not change within a loop
Scalar temporaries - scalar variable set equal to a vectorizable expression and used later in the loop on the right hand side of an assignment statement
Vectorizable expression - arithmetic or logical expression consisting of any combination of; invariant expressions, loop induction variables, vector array references, scalar temporaries, intrinsic functions
Vectorizable loop - innermost loop, contains only vectorizable expressions

Terminology Examples

Example 1

DO 30 I=1, 1000

A(I) = B(I) * 5.0

30 CONTINUE

B = vector array reference

I = constant increment integer

Example 2

DO 40 I =1, N

A(I) = B(INDX(I)) + C(K)

40 CONTINUE

INDX = indirect-address vector

K = invariant array element

Example 3

DO 50 I =1, N

TMP = F(I) - 32.0

C(I) = TFVAP/TCVAP * TMP

50 CONTINUE

I = loop induction variable

TFVAP/TCPVAP = invariant expression

TMP = scalar temporary

General Requirements for vectorization

DO loops, DO WHILE loops and IF loops can be vectorized if they fulfil the following criteria:-

In a nested structure, only the innermost loops can be vectorized
Vector and scalar versions of code must give equivalent results
Other requirements are based on hardware considerations and limits on the complexity of the loop.

Conditions inhibiting vectorization

In an innermost loop inhibitors are:

references to external code
- I/O statements (except in an implied DO loop),
- functions which do not have vector versions
- external subroutines or functions which are not expanded inline
- any RETURN, STOP, or PAUSE statements as these generate library calls
Obsolete conditional statements
- arithmetic IF
- assigned GOTO and computed GOTO
Backward branches other than the one which forms the loop.
Source directives NOVECTOR, NEXTSCALAR or SUPPRESS.
A statement branch into the loop from outside the loop.
Array bounds checking.
Dependencies ie constructs producing different results in scalar and vector mode eg recurrences and ambiguous subscript references.

Vectorization inhibitors

Subroutine calls

DO I = 1, N

CALL STEP1(A,B,C) !SCALAR

D(I) = A(K) *B(K) + CONST

END DO

External function calls (intrinsic OK)

DO I = 1, N

A(I) = ADD(B(I), C(I)) !SCALAR

END DO

DO I = 1, N

A(I) = SIN(B(I)) !VECTOR

END DO

Character variables

CHARACTER*80 ABC(100)

...

DO I = 1, 100

ABC(I) = `Eschew Obscuration` !scalar

END DO

Input/Output statements

DO I = 1, N

READ(20) A(I) !SCALAR

END DO

READ(2) (A(I), I=1, 100) !vector version

Assigned GOTO statements

DO 220 I = 1, N

GOTO IJK (200,210) !SCALAR

200 CONTINUE

X(I) = A(I)

GOTO 220

210 CONTINUE

X(I) = B(I)

220 CONTINUE

Arithmetic IF

DO 290 I =1, N

IF (IJK(I)), 260, 270, 280 !SCALAR

260 CONTINUE

A(I) = 0.

GOTO 290

270 CONTINUE

B(I) = 0.

GOTO 290

280 CONTINUE

C(I) = 0.

290 CONTINUE

Recursion (data dependency)

DO I=2,N

A(I) = A(I-1) * X(I) !scalar

END DO

Recursion - an expression in one iteration of a loop requires a value that was defined by an expression in a previous iteration of that loop

ALPHA(1) = A(1)

DO 60 I=2, N

ALPHA(I) = A(I) - (B(I)*C(I-1))/ALPHA(I-1)

60 CONTINUE

Branching due to IF or GOTO statements may inhibit or degrade vectorization

DO I = 1, N

IF (X(I) .LE.ALIM) THEN

A(I) = B(I) * 2.0 + C(I)

ELSE

A(I) = D(I) * 5.0 + E(I)

END IF

END DO

Memory contention and optimization

Vectorization performance can be degraded by memory contention ie conflict between successive accesses to the same memory bank.

Accessing elements consecutively in a row eg

REAL A(256,100), B(256,100)

...

DO I=1, 256

DO J=1, 100 ! Accesses rows - stride 256

A(I,J) = B(I,J) * 2

END DO

Memory contention

Efficient memory access can be ensured by writing loops;
- accessing elements down columns (incrementing the first subscript on consecutive elements);
- vectors to be loaded/stored have odd strides. An odd stride guarantees that successive accesses are not made to the same memory bank.

Example

REAL A(256,100), B(256,100)

DO J=1, 100

DO I=1, 256 ! Accesses columns - stride 1

A(I,J) = B(I,J) * 2

END DO

The following example retains memory accesses along rows but with a stride of one because each column length is now 257:

REAL A(257,100),B(257,100)

* Column length 257

...

DO I=1, 256

DO J=1, 100

A(I,J) = B(I,J) * 2

END DO

STRIDE EXAMPLES

Stride 1 Accessing columns

DO 300 J = 1, 3

DO 300 I = 1, 4

MATRIX(I,J) = MATRIX(I,J)+1.0

300 CONTINUE

Stride 4 Accessing rows

DO 310 I=1, 4

DO 310 J =1, 3

MATRIX(I,J) = MATRIX(I,J)+1.0

310 CONTINUE

Stride 5 Accessing diagonally

DO 320 I=1, 4

MATRIX(I,I) = MATRIX(I,I)+1.0

300 CONTINUE

Layout of a 2D Array in Memory

Amdahl's Law for vectorization

Amdahl's law states that the performance of a program is dominated by its slowest component which in the case of a vectorized program is scalar code.

Amdahl's Law - the formulation for vector code which is R times faster than scalar code is:

sv = maximum expected speedup

fv = fraction of a program that is vectorized

fs = fraction of a program that is scalar = 1 - fv

Rv = ratio of scalar to vector processing time

For Cray Research systems Rv ranges from 10 to 20.

It is common for programs to be 70% to 80% vectorized ie 70% to 80% of their running time is spent executing vector instructions.
Not always easy to reach 70% to 80% vectorization in a program and vectorizing beyond this level becomes increasingly difficult normally requiring major changes to the algorithm.
Many users stop their vectorization efforts once the vectorized code is running 2 to 4 times faster than scalar code.

Dependencies

A dependency is a primary inhibitor of vectorization occurring when scalar and vector processing give different results.
A recurrence is a data dependency between loop iterations occurring when one loop iteration requires a value that was defined in a previous iteration. Eg

DO I =2, 4

IB(I) = IA(I-1)

IA(I) = IC(I)

ENDDO

This code contains a recurrence caused by the subscript (I-1) hence the vector version would produce incorrect results.

Scalar Processing Order and Results

The two statements are each executed three times, with the operations alternating;

Results of Scalar processing

Event Operation Values

1 IB(2) = IA(2-1) 3

2 IA(2) = IC(2) 4

3 IB(3) = IA(3-1) 4

4 IA(3) = IC(3) 6

5 IB(4) = IA(4-1) 6

6 IA(4) = IC(4) 8

Vector Processing Order and Results

Vector processing changes the order in which operations are performed on individual array elements. First line within the loop processes all elements of the array before the second line is executed.

Results of Vector processing

Event Operation Values

1 IB(2) = IA(2-1) 3

2 IB(3) = IA(3-1) 6

3 IB(4) = IA(4-1) 9

4 IA(2) = IC(2) 4

5 IA(3) = IC(3) 6

6 IA(4) = IC(4) 8

Testing for dependency

It is necessary to determine whether two appearances of an array in a loop can create a dependency conflict

Example

DO 20 J = 2, M

Z(J) = YY(J) + TEMPA

R(J) = Z(J + 1)/TEMPB

20 CONTINUE

Array Z is both defined and referenced within the loop ie one appearance is the key definition where array Z is defined and the other reference is where array Z is referenced.

Test the loop using the key definition and the other reference as follows:

determine whether the other reference is in the previous or the subsequent area eg Z(J) is the key definition and Z(J+1), the other reference is subsequent to the key definition;
determine if the subscript of the other reference is greater than or less than the subscript of the key definition eg Z(J+1) is greater than the subscript for Z(J);
determine whether the array subscripts are incrementing or decrementing on each iteration of the loop eg J is incrementing on each iteration.

The use of an array in a loop has the following characteristics:
- An array's other reference is either Previous or Subsequent to its key definition
- The subscript on the other reference is either Greater or Less than on the key definition
- The array's subscript is either Incrementing or Decrementing
These characteristics can be abbreviated to summarize a total of 8 possibilities for loop-dependency analysis.
4 cases indicate dependencies which inhibit vectorization - SLD, SGI, PLI, PGD.
4 cases indicate cases which do not inhibit vectorization - SGD, SLI, PLD, PGI.

Examples which inhibit vectorization

SUBROUTINE SGI(A, B, C)

DO 10 I = 1, 99

A(I) = B(I)

C(I) = A(I+1)

10 CONTINUE

END

SUBROUTINE SLD(A, B, C)

DO 10 I = 100, 2, -1

A(I) = B(I)

C(I) = A(I-1)

10 CONTINUE

END

SUBROUTINE PLI(A, B, C)

DO 10 I = 2, 100

B(I) = A(I-1)

A(I) = C(I)

10 CONTINUE

END

SUBROUTINE PGD(A, B, C)

DO 10 I = 99, 1, -1

B(I) = A(I+1)

A(I) = C(I)

10 CONTINUE

END

Examples which do not inhibit

SUBROUTINE SGD

DO 10 I = 99, 1, -1

A(I) = B(I)

C(I) = A(I+1)

10 CONTINUE

END

SUBROUTINE SLI(A, B, C)

DO 10 I = 2, 100

A(I) = B(I)

C(I) = A(I-1)

10 CONTINUE

END

SUBROUTINE PLD(A, B, C)

DO 10 I = 100, 2, -1

B(I) = A(I-1)

A(I) = C(I)

10 CONTINUE

END

SUBROUTINE PGI(A, B, C)

DO 10 I = 1, 99

B(I) = A(I+1)

A(I) = C(I)

10 CONTINUE

END

Rigorous testing for dependency

A more rigorous test takes into account the stride of the indexes of arrays:-

DO 20 J = 2, M, 2

Z(J) = YY(J) + TEMPA

R(J) = Z(J+1) / TEMPB

20 CONTINUE

Z is both defined and referenced in the loop
Z(J) is the key definition and Z(J+1) is the other reference
Use these to be the most previous reference ie ref1 and the most subsequent ie ref2. Define index1 as the index of ref1 eg index of Z(J) is J, define index2 as the index of ref2 eg the index of Z(J+1) which is J+1, and stride which is 2.
If the sign of index2 minus index1 equals the sign of stride there may be a dependency so proceed to the next step otherwise no dependency exists eg index1 = J, index2 = J+1, stride = 2
The sign of index2 - index1 = sign of ((J+1)-(J)) = the sign of (1), which is positive
The sign of stride = the sign of 2, which is positive
There may be a dependency in the example because the sign of index1 minus index2 equals the sign of stride (both positive) so it is necessary to do the next step.
If (index2 minus index1) mod stride equals 0) there is a dependency, otherwise no dependency exists for example:

index1 = J, index2 = J+1, stride = 2:

(index1-index2) mod stride =

((J+1)-(J)) mod 2 = 1

There is no data dependency as the stride does not equal 0.

Vectorizing recurrences

Prevent the recurrence from affecting the result.
The threshold of a recurrence is the number of iterations that occur before a value is used, but if the vector length equals the recurrence, then the recurrence does not affect the results in the vectorized version eg a recurrence whose threshold is 64 is fully vectorized.
If the compiler can detect a threshold value in the range 2 < k< 64, the loop is vectorized with a vector length of k eg threshold is 6 vectorized with a vector length of 6.

SUBROUTINE SHORT_VL(A)

DIMENSION A(100)

DO 20 I = 7, 100

A(I) = A(I-6) + 1.0

20 CONTINUE

END

SAFE VECTOR LENGTH

Compiler can include a run-time test to determine a safe vector length
- safe length is < or = to the recurrence threshold
- threshold value need not be known at compile time ie equals k if k<64 otherwise 64
- Use of safe vector length can degrade performance considerably by allowing lengths of 1 and 2

PROGRAM safeval;

DIMENSION A(-100:100), B(100), C(100)

K = KFUN(A(I))

N = NFUN(B(I)) ! unknown at compile time

DO I = K, N ! N is vector length

A(I) = A(I) + (I-K)

END DO

SUBROUTINE RUNTIME

DIMENSION A(-100:100), B(100), C(100)

COMMON // J

DO I = 1, 100

C(I) = A(I-J)

A(I) = B(I)

END DO

END

if J >= 64 then vector length = 64
if J < 1 then vector length = 64
if 1<= J<64 then vector length = J

Data dependency directives

Data dependency directives can be used to provide the compiler, CF77, with additional information so that code can be fully optimized.
- CFPP$ NODEPCHK directs the compiler to ignore potential data dependencies in a loop but is only safe to use when absolutely sure that no recurrence exists
- CFPP$ NOEQCHK directs the compiler to examine equivalence statements for recurrences
- CFPP$ RELATION used to provide additional information about array subscript ranges thereby determining whether or not a loop is safe to vectorize

Loops

IF loops and search loops - both must satisfy the following requirements in order to be vectorized:
- the loop must be executed the correct number of times - no early exits decrease the number of loop iterations from the trip count
- the correct exit must be taken
- all scalar values must be correct on exit
- vectorization - IF expression is evaluated for the full set of loop iterations indicated by the loops DO statement; the point where the IF expression is satisfied indicates the vector length to be used for the other expressions in the loop. This length is used when those expressions are executed.
- vectorization - IF statements - set vector mask based on conditional result, if no bits in the vector mask are set that block of work is skipped, and elements corresponding to true conditions are gathered, the result computed, and the result scattered back into memory.
The code may be changed to comply with the requirements for vectorization eg
- Code motion ie moving early exits - so the exit precedes the block's vectorizable statements and the number of iterations are known before the vectorizable statements are executed.
- Values computed following exit - condition for an early exit must not depend on values computed (ie in a previous iteration) in the portion of the loop following the exit.
- Indirect addressing - this inhibits vectorization when the exit condition involves indirect addressing as this may lead to range errors.
Branches - loops containing branches must comply with certain requirements to be vectorized
- branches into a loop from outside prohibit vectorization
- loops containing backward branches cannot be vectorized since a backward branch is itself a loop
- loops with forward branches permit vectorization
- arithmetic IFs which are obsolete are not vectorizable since they have multiple destinations
Transformation - compiler's first phase FPP removes vectorization inhibitors:
- converts IF loops to DO loops when they have a single entrance and a single exit
- compiler can analyse any combination of conditional assignments, conditional and unconditional forward branching, and block IFs.
Loop nest restructuring
- CF77 examines IF and DO loops within a nest of loops for possible optimization
- Examines loops from innermost loops outward until a nontranslatable construct is reached.

DO I=1, N !N>>10

DO J=1,10

A(I,J) = B(I,J) * C(I,J)

END DO

Loop nest restructuring
- Restructured or reordered so that the longest vectorizable loop is the innermost loop

DO J=1,10

DO I=1, N

A(I,J) = B(I,J) * C(I,J)

END DO

Loop optimizations

Loop collapse - converts a nest of loops into a single loop with a large iteration count
Loop fusion - combines consecutive loops with no statements between them

DO I=1,N

A(I) = B(I) + C(I)

END DO

DO I=2, N+1

D(I) = E(I) **2

END DO

Loop fusion

Combined into

DO I=1,N

A(I) = B(I) + C(I)

D(I+1) = E(I+1) **2

END DO

Source-level loop unrolling - makes a copy of the loop body for every iteration to be executed eg

DO I =1, N

DO J=1, 4

A(I,J) = B(I,J) * C(I,J)

END DO

DO I =1, N !Unrolled - vertically

A(I,1) = B(I,1) * C(I,1)

A(I,2) = B(I,2) * C(I,2)

A(I,3) = B(I,3) * C(I,3)

A(I,4) = B(I,4) * C(I,4) !etc

Source-level loop unrolling

DO I =1, N

T = 0

DO J=1, 4

T = T + B(I,J)

END DO

A(I) = A(I) + T

END DO

DO I =1, N !Unrolled horizontally

A(I) = A(I) + B(I,1)+B(I,2)+B(I,3)+B(I,4)

END DO

Common optimization techniques

Loop splitting - split non-vectorizing loops into loops containing vectorizable and non-vectorizable statements

DO I=2, N

A(I) = A(I-1) * D(I)

B(I) = A(I) + C(I-1) **2

END DO

DO I=2, N ! Modified

A(I) = A(I-1) * D(I)

END DO

DO I=2, N

B(I) = A(I) + C(I-1) **2

END DO

Subroutine in-lining - subroutine and function calls inside a loop prevent vectorization. The loop can be vectorized by bringing the code from the subroutine or function in-line. This can be done manually, using the compiler, or for functions by using a statement function.
Promoting scalar to vector - scalar recurrences may be avoided using a temporary vector eg

DO J=1, M

S = BB

DO I=1, N

S = S*C

A(I) = A(I) + S

END DO

Promoting scalar to vector modified code

DO I=1, M

TV(I) = BB

END DO

DO I=1, N

DO J=1, M

TV(J) = TV(J) * C

A(I) = A(I) + TV(J)

END DO

Using the PARAMETER statement - can improve compiler optimization by providing more compile time information about loop lengths and potential data dependencies eg

DO I=2, N

A(I-1) = A(I-S) * B(I) !CONDITIONAL

END DO

Modified

PARAMETER (S=1)

...

DO I=2, N

A(I-1) = A(I-S) * B(I) !VECTOR

END DO

IF block ordering - placing the most frequently executed conditions first in block IFs eg

DO I=1, N

IF (A(I) .EQ. 0) THEN

B(I) = AZERO !Rare

ELSEIF (A(I) .LT. 0) THEN

B(I) = ANEG !Rare

ELSE

B(I) = C(I) * D(I)/A(I)

ENDIF

END DO

IF re-ordering modified code

DO I=1, N

IF (A(I) .GT. 0) THEN

B(I) = C(I) * D(I)/A(I) !Frequent

ELSEIF (A(I) .EQ. 0) THEN

B(I) = AZERO

ELSE

B(I) = ANEG

ENDIF

END DO

Compiler vectorization directives

VECTOR/NOVECTOR Turns vectorization on or off

NEXTSCALAR Disables vectorization for the next DO loop between the directive and the end of the program unit

IVDEP (SAFEVL = n) Indicates that any dependencies can be ignored if the vector length does not exceed n - only used when it is known that any apparent dependencies will not cause invalid results if a loop is vectorized

VFUNCTION Declares that a vector version of an external function exists

SHORTLOOP Allows the compiler to generate faster code when a loop's trip count is 64 or less

VSEARCH/NOVSEARCH Default is VSEARCH as this enables vectorization of all search loops until a NOVSEARCH is encountered

RECURRENCE/NORECURRENCE

Enables or disables vectorization of all reduction loops again until a NORECURRENCE is encountered or the end of the program unit

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker