The Queen's University of Belfast

Parallel Computer Centre
[Next] [Previous] [Top]
VECTOR PROCESSING
Vector Processing
- Special Case of SISD
- Simple but powerful technique (x10). Uses a technique known as pipelining.
- A SCALAR is a single value and therefore Scalar processing consists of performing operations on scalars one at a time.
- A VECTOR is a series of values eg a Fortran array or a sub-section of it.
- Some operations, such as ADD, consist of multiple steps.
- VECTORIZATION allows these internal steps to proceed simultaneously thus processing as many elements as the number of steps in the operation.
- Consider a simple illustration - building chairs from kits.
- A four stage process - one time step each.


Vector Multiply
- The multiply process is made up of a number of stages.

- Vector processors have hardware which allow these stages to work independently and pass results to each other in a "assembly line" manner.
- Special registers known as VECTOR registers are used to hold data for a vector operation. These are "loaded" before a vector operation takes place.
- Consider the following piece of Fortran:
DO 10 I=1,100
A(I) = B(I) * C(I)
10 CONTINUE
- If this calculation is performed in serial mode each addition will take 4 clock cycles. The whole loop will take MORE than 400 clock cycles.
- If this calculation was performed in vector mode the following would take place:
- A series of elements from B would be loaded into one vector register and a series of elements from C would be loaded into a second vector register (these can be done simultaneously)
- The first two elements of B and C would be fed into the multiply pipeline. As they progress along the pipeline more elements are fed in. After 4clock cycles the first result will be produced. This is stored in a third vector register ie A.
- The values in the third vector register are stored in the main memory at the location of array A.
- Time for the multiply alone 103 clock cycles.
Pseudo Code
VLOAD B VR1
VLOAD C VR2
VMULT VR1 VR2 VR3
VSTORE VR3 C
Chaining
The results of one vector operation may be fed into another. Multiple vector operations can be chained into one long operation - changing the ORDER of processing.
- load instructions may be overlapped
- once the first elements of B and C are available the multiply operation may start
- when the first results are ready from the multiply operation then the Store instruction may begin
- for long vectors all 4 operations are executing in parallel
- CRAY YMP Vector Processor - example timing:
Vector Load : 17 clock cycles
Vector Multiply : 4 clock cycles
Vector Store : 17 clock cycles
Total : 38 clock cycles
- The first result would be produced after 38 clock cycles and one would be produced on every clock cycle after that. So the total number of cycles required would be:
- 38 + 99 = 137
- which is faster than the scalar time of 400
Vector Hardware
Vector functional units
- pipelined, fully segmented
- each stage of the pipeline performs a step of the function on different operands
- once the pipeline is full, a new result is produced each clock period(cp)

Vector registers and pipeline
Pipelining and Chaining
- a pipeline can segment the arithmetic operation ie passing the output of one stage to the next stage as input
- the size of a pipeline can be increased by chaining
- chaining means the result from a pipeline can be used as an operand in a second pipeline
S(I) = A * X(I) + Y(I)

Comparison - vector and scalar operation
- A scalar operation works on only one pair of operands from the S register and returns the result to another S register.
- A vector operation can work on 64 pairs of operands together to produce 64 results executing only one instruction.

Vector overhead is larger than scalar - the vector length is computed to find out how many vector registers are needed.
- A vector register holds up to 64 words - processed in 64 element segments
- Decreases in performance at each point in which the vector length spills over into a new register - avoid situations where the number of elements to be processed exceeds the register capacity by a small amount eg vector length of 65.
- Cray can receive a result by a vector register and retransmit it as an operand to a subsequent operation in one clock period
- A register may be both a result and an operand register allowing the chaining of two or more vector operations together.
- 2 or more results - per clock cycle.
Scalar and Vector Processing Example
DO 10 I = 1, 256
JJ(I) = KK(I)+LL(I)
10 CONTINUE
Both methods of processing must produce the same results for each array element.
Scalar Processing
- Read one element of Fortran array KK
- Read one element of LL
- Add the results
- Write the results to the Fortran array JJ
- Increment the loop index by 1
- Repeat the above sequence for each succeeding array element until the loop index equals its limit.
Vector Processing
- Load a series of elements from array KK to a vector register and a series of elements from array LL to another vector register (these operations occur simultaneously except for instruction issue time)
- Add the corresponding elements from the two vector registers and send the results to another vector register, representing array JJ
- Store the register used for array JJ to memory
- Repeat this sequence if the array had more elements than the maximum elements used in vector processing ie 64.
Scalar and Vector Processing Example
DO 10 I = 1, 3
L(I) = J(I)+K(I)
N(I) = L(I)+M(I)
10 CONTINUE
Both methods of processing must produce the same results for each array element.
Scalar Processing Order and Results
- The two statements are each executed three times, with the operations alternating;
L(I) is calculated before N(I) in each iteration L(I) is then used to calculate the value of N(I).
Results of Scalar processing
Event Operation Values
1 L(1) = J(1)+K(1) 7 = 2 + 5
2 N(1) = L(1)+M(1) 11 = 7 + 4
3 L(2) = J(2)+K(2) -1 = (-4) + 3
4 N(2) = L(2)+M(2) 5 = (-1) + 6
5 L(3) = J(3)+K(3) 15 = 7 + 8
6 N(3) = L(3)+M(3) 13 = 15 + (-2)
Vector Processing Order and Results
- Vector processing changes the order in which operations are performed on individual array elements. First line within the loop processes all elements of the array before the second line is executed.
Results of Vector processing
Event Operation Values
1 L(1) = J(1)+K(1) 7 = 2 + 5
2 L(2) = J(2)+K(2) -1 = (-4) + 3
3 L(3) = J(3)+K(3) 15 = 7 + 8
4 N(1) = L(1)+M(1) 11 = 7 + 4
5 N(2) = L(2)+M(2) 5 = (-1) + 6
6 N(3) = L(3)+M(3) 13 = 15 + (-2)
Vector Performance
This is governed by a number of factors eg
- Architecture (machine dependent factors)
- relative scalar to vector instruction speed
- pipeline length
- pipeline startup cost
- Software (dependent factors)
- vector length ie using optimal vector length
- vector code fraction ie increasing this
Terminology
- Vector - a series of values on which instructions operate eg array or array subsets (ie columns, rows, diagonals where the intervals between the array elements locations are constant)
- Vector Length - number of elements in a vector ie maximum of 64
- Stride - interval between memory locations for successive elements of a vector, vectorization requires a constant stride ie an interval that is the same for all consecutive elements of a vector
- Dependency - anything causing scalar and vector code to give different results, a recurrence or data dependency is an expression within a loop requiring a value calculated in a previous iteration
- Chime - a sequence of vector operations that can be chained with a single vector load and store
- Constant increment integer - scalar integer variable whose value changes a fixed amount on every iteration of a loop
- Indirect-address vector - an array which is referenced by an index array
- Vector array reference - an array which is referenced within a loop by a constant increment integer
- Invariant array elements - array references which do not change within a loop
- Loop induction variable - constant loop increment integer used as array reference
- Invariant expression - a constant, variable, or expression which does not change within a loop
- Scalar temporaries - scalar variable set equal to a vectorizable expression and used later in the loop on the right hand side of an assignment statement
- Vectorizable expression - arithmetic or logical expression consisting of any combination of; invariant expressions, loop induction variables, vector array references, scalar temporaries, intrinsic functions
- Vectorizable loop - innermost loop, contains only vectorizable expressions
Terminology Examples
DO 30 I=1, 1000
A(I) = B(I) * 5.0
30 CONTINUE
B = vector array reference
I = constant increment integer
DO 40 I =1, N
A(I) = B(INDX(I)) + C(K)
40 CONTINUE
INDX = indirect-address vector
K = invariant array element
DO 50 I =1, N
TMP = F(I) - 32.0
C(I) = TFVAP/TCVAP * TMP
50 CONTINUE
I = loop induction variable
TFVAP/TCPVAP = invariant expression
TMP = scalar temporary
General Requirements for vectorization
DO loops, DO WHILE loops and IF loops can be vectorized if they fulfil the following criteria:-
- In a nested structure, only the innermost loops can be vectorized
- Vector and scalar versions of code must give equivalent results
- Other requirements are based on hardware considerations and limits on the complexity of the loop.
Conditions inhibiting vectorization
In an innermost loop inhibitors are:
- references to external code
- I/O statements (except in an implied DO loop),
- functions which do not have vector versions
- external subroutines or functions which are not expanded inline
- any RETURN, STOP, or PAUSE statements as these generate library calls
- Obsolete conditional statements
- arithmetic IF
- assigned GOTO and computed GOTO
- Backward branches other than the one which forms the loop.
- Source directives NOVECTOR, NEXTSCALAR or SUPPRESS.
- A statement branch into the loop from outside the loop.
- Array bounds checking.
- Dependencies ie constructs producing different results in scalar and vector mode eg recurrences and ambiguous subscript references.
Vectorization inhibitors
DO I = 1, N
CALL STEP1(A,B,C) !SCALAR
D(I) = A(K) *B(K) + CONST
END DO
- External function calls (intrinsic OK)
DO I = 1, N
A(I) = ADD(B(I), C(I)) !SCALAR
END DO
DO I = 1, N
A(I) = SIN(B(I)) !VECTOR
END DO
CHARACTER*80 ABC(100)
...
DO I = 1, 100
ABC(I) = `Eschew Obscuration` !scalar
END DO
DO I = 1, N
READ(20) A(I) !SCALAR
END DO
READ(2) (A(I), I=1, 100) !vector version
DO 220 I = 1, N
GOTO IJK (200,210) !SCALAR
200 CONTINUE
X(I) = A(I)
GOTO 220
210 CONTINUE
X(I) = B(I)
220 CONTINUE
DO 290 I =1, N
IF (IJK(I)), 260, 270, 280 !SCALAR
260 CONTINUE
A(I) = 0.
GOTO 290
270 CONTINUE
B(I) = 0.
GOTO 290
280 CONTINUE
C(I) = 0.
290 CONTINUE
- Recursion (data dependency)
DO I=2,N
A(I) = A(I-1) * X(I) !scalar
END DO
- Recursion - an expression in one iteration of a loop requires a value that was defined by an expression in a previous iteration of that loop
ALPHA(1) = A(1)
DO 60 I=2, N
ALPHA(I) = A(I) - (B(I)*C(I-1))/ALPHA(I-1)
60 CONTINUE
- Branching due to IF or GOTO statements may inhibit or degrade vectorization
DO I = 1, N
IF (X(I) .LE.ALIM) THEN
A(I) = B(I) * 2.0 + C(I)
ELSE
A(I) = D(I) * 5.0 + E(I)
END IF
END DO
Memory contention and optimization
Vectorization performance can be degraded by memory contention ie conflict between successive accesses to the same memory bank.
Accessing elements consecutively in a row eg
REAL A(256,100), B(256,100)
...
DO I=1, 256
DO J=1, 100 ! Accesses rows - stride 256
A(I,J) = B(I,J) * 2
END DO
END DO
Memory contention
- Efficient memory access can be ensured by writing loops;
- accessing elements down columns (incrementing the first subscript on consecutive elements);
- vectors to be loaded/stored have odd strides. An odd stride guarantees that successive accesses are not made to the same memory bank.
Example
REAL A(256,100), B(256,100)
DO J=1, 100
DO I=1, 256 ! Accesses columns - stride 1
A(I,J) = B(I,J) * 2
END DO
END DO
The following example retains memory accesses along rows but with a stride of one because each column length is now 257:
REAL A(257,100),B(257,100)
* Column length 257
...
DO I=1, 256
DO J=1, 100
A(I,J) = B(I,J) * 2
END DO
END DO
STRIDE EXAMPLES
- Stride 1 Accessing columns
DO 300 J = 1, 3
DO 300 I = 1, 4
MATRIX(I,J) = MATRIX(I,J)+1.0
300 CONTINUE
DO 310 I=1, 4
DO 310 J =1, 3
MATRIX(I,J) = MATRIX(I,J)+1.0
310 CONTINUE
- Stride 5 Accessing diagonally
DO 320 I=1, 4
MATRIX(I,I) = MATRIX(I,I)+1.0
300 CONTINUE
Layout of a 2D Array in Memory

Amdahl's Law for vectorization
- Amdahl's law states that the performance of a program is dominated by its slowest component which in the case of a vectorized program is scalar code.
Amdahl's Law - the formulation for vector code which is R times faster than scalar code is:

sv = maximum expected speedup
fv = fraction of a program that is vectorized
fs = fraction of a program that is scalar = 1 - fv
Rv = ratio of scalar to vector processing time
For Cray Research systems Rv ranges from 10 to 20.
- It is common for programs to be 70% to 80% vectorized ie 70% to 80% of their running time is spent executing vector instructions.
- Not always easy to reach 70% to 80% vectorization in a program and vectorizing beyond this level becomes increasingly difficult normally requiring major changes to the algorithm.
- Many users stop their vectorization efforts once the vectorized code is running 2 to 4 times faster than scalar code.
Dependencies
- A dependency is a primary inhibitor of vectorization occurring when scalar and vector processing give different results.
- A recurrence is a data dependency between loop iterations occurring when one loop iteration requires a value that was defined in a previous iteration. Eg
DO I =2, 4
IB(I) = IA(I-1)
IA(I) = IC(I)
ENDDO
- This code contains a recurrence caused by the subscript (I-1) hence the vector version would produce incorrect results.
Scalar Processing Order and Results
- The two statements are each executed three times, with the operations alternating;
Results of Scalar processing
Event Operation Values
1 IB(2) = IA(2-1) 3
2 IA(2) = IC(2) 4
3 IB(3) = IA(3-1) 4
4 IA(3) = IC(3) 6
5 IB(4) = IA(4-1) 6
6 IA(4) = IC(4) 8
Vector Processing Order and Results
- Vector processing changes the order in which operations are performed on individual array elements. First line within the loop processes all elements of the array before the second line is executed.
Results of Vector processing
Event Operation Values
1 IB(2) = IA(2-1) 3
2 IB(3) = IA(3-1) 6
3 IB(4) = IA(4-1) 9
4 IA(2) = IC(2) 4
5 IA(3) = IC(3) 6
6 IA(4) = IC(4) 8
Testing for dependency
- It is necessary to determine whether two appearances of an array in a loop can create a dependency conflict
Example
DO 20 J = 2, M
Z(J) = YY(J) + TEMPA
R(J) = Z(J + 1)/TEMPB
20 CONTINUE
Array Z is both defined and referenced within the loop ie one appearance is the key definition where array Z is defined and the other reference is where array Z is referenced.
Test the loop using the key definition and the other reference as follows:
- determine whether the other reference is in the previous or the subsequent area eg Z(J) is the key definition and Z(J+1), the other reference is subsequent to the key definition;

- determine if the subscript of the other reference is greater than or less than the subscript of the key definition eg Z(J+1) is greater than the subscript for Z(J);
- determine whether the array subscripts are incrementing or decrementing on each iteration of the loop eg J is incrementing on each iteration.
- The use of an array in a loop has the following characteristics:
- An array's other reference is either Previous or Subsequent to its key definition
- The subscript on the other reference is either Greater or Less than on the key definition
- The array's subscript is either Incrementing or Decrementing
- These characteristics can be abbreviated to summarize a total of 8 possibilities for loop-dependency analysis.
- 4 cases indicate dependencies which inhibit vectorization - SLD, SGI, PLI, PGD.
- 4 cases indicate cases which do not inhibit vectorization - SGD, SLI, PLD, PGI.
Examples which inhibit vectorization
SUBROUTINE SGI(A, B, C)
DO 10 I = 1, 99
A(I) = B(I)
C(I) = A(I+1)
10 CONTINUE
END
SUBROUTINE SLD(A, B, C)
DO 10 I = 100, 2, -1
A(I) = B(I)
C(I) = A(I-1)
10 CONTINUE
END
SUBROUTINE PLI(A, B, C)
DO 10 I = 2, 100
B(I) = A(I-1)
A(I) = C(I)
10 CONTINUE
END
SUBROUTINE PGD(A, B, C)
DO 10 I = 99, 1, -1
B(I) = A(I+1)
A(I) = C(I)
10 CONTINUE
END
Examples which do not inhibit
SUBROUTINE SGD
DO 10 I = 99, 1, -1
A(I) = B(I)
C(I) = A(I+1)
10 CONTINUE
END
SUBROUTINE SLI(A, B, C)
DO 10 I = 2, 100
A(I) = B(I)
C(I) = A(I-1)
10 CONTINUE
END
SUBROUTINE PLD(A, B, C)
DO 10 I = 100, 2, -1
B(I) = A(I-1)
A(I) = C(I)
10 CONTINUE
END
SUBROUTINE PGI(A, B, C)
DO 10 I = 1, 99
B(I) = A(I+1)
A(I) = C(I)
10 CONTINUE
END
Rigorous testing for dependency
A more rigorous test takes into account the stride of the indexes of arrays:-
DO 20 J = 2, M, 2
Z(J) = YY(J) + TEMPA
R(J) = Z(J+1) / TEMPB
20 CONTINUE
- Z is both defined and referenced in the loop
- Z(J) is the key definition and Z(J+1) is the other reference
- Use these to be the most previous reference ie ref1 and the most subsequent ie ref2. Define index1 as the index of ref1 eg index of Z(J) is J, define index2 as the index of ref2 eg the index of Z(J+1) which is J+1, and stride which is 2.
- If the sign of index2 minus index1 equals the sign of stride there may be a dependency so proceed to the next step otherwise no dependency exists eg index1 = J, index2 = J+1, stride = 2
- The sign of index2 - index1 = sign of ((J+1)-(J)) = the sign of (1), which is positive
- The sign of stride = the sign of 2, which is positive
- There may be a dependency in the example because the sign of index1 minus index2 equals the sign of stride (both positive) so it is necessary to do the next step.
- If (index2 minus index1) mod stride equals 0) there is a dependency, otherwise no dependency exists for example:
index1 = J, index2 = J+1, stride = 2:
(index1-index2) mod stride =
((J+1)-(J)) mod 2 = 1
- There is no data dependency as the stride does not equal 0.
Vectorizing recurrences
- Prevent the recurrence from affecting the result.
- The threshold of a recurrence is the number of iterations that occur before a value is used, but if the vector length equals the recurrence, then the recurrence does not affect the results in the vectorized version eg a recurrence whose threshold is 64 is fully vectorized.
- If the compiler can detect a threshold value in the range 2 < k< 64, the loop is vectorized with a vector length of k eg threshold is 6 vectorized with a vector length of 6.
SUBROUTINE SHORT_VL(A)
DIMENSION A(100)
DO 20 I = 7, 100
A(I) = A(I-6) + 1.0
20 CONTINUE
END
SAFE VECTOR LENGTH
- Compiler can include a run-time test to determine a safe vector length
- safe length is < or = to the recurrence threshold
- threshold value need not be known at compile time ie equals k if k<64 otherwise 64
- Use of safe vector length can degrade performance considerably by allowing lengths of 1 and 2
PROGRAM safeval;
DIMENSION A(-100:100), B(100), C(100)
K = KFUN(A(I))
N = NFUN(B(I)) ! unknown at compile time
DO I = K, N ! N is vector length
A(I) = A(I) + (I-K)
END DO
SUBROUTINE RUNTIME
DIMENSION A(-100:100), B(100), C(100)
COMMON // J
DO I = 1, 100
C(I) = A(I-J)
A(I) = B(I)
END DO
END
- if J >= 64 then vector length = 64
- if J < 1 then vector length = 64
- if 1<= J<64 then vector length = J
Data dependency directives
- Data dependency directives can be used to provide the compiler, CF77, with additional information so that code can be fully optimized.
- CFPP$ NODEPCHK directs the compiler to ignore potential data dependencies in a loop but is only safe to use when absolutely sure that no recurrence exists
- CFPP$ NOEQCHK directs the compiler to examine equivalence statements for recurrences
- CFPP$ RELATION used to provide additional information about array subscript ranges thereby determining whether or not a loop is safe to vectorize
Loops
- IF loops and search loops - both must satisfy the following requirements in order to be vectorized:
- the loop must be executed the correct number of times - no early exits decrease the number of loop iterations from the trip count
- the correct exit must be taken
- all scalar values must be correct on exit
- vectorization - IF expression is evaluated for the full set of loop iterations indicated by the loops DO statement; the point where the IF expression is satisfied indicates the vector length to be used for the other expressions in the loop. This length is used when those expressions are executed.
- vectorization - IF statements - set vector mask based on conditional result, if no bits in the vector mask are set that block of work is skipped, and elements corresponding to true conditions are gathered, the result computed, and the result scattered back into memory.
- The code may be changed to comply with the requirements for vectorization eg
- Code motion ie moving early exits - so the exit precedes the block's vectorizable statements and the number of iterations are known before the vectorizable statements are executed.
- Values computed following exit - condition for an early exit must not depend on values computed (ie in a previous iteration) in the portion of the loop following the exit.
- Indirect addressing - this inhibits vectorization when the exit condition involves indirect addressing as this may lead to range errors.
- Branches - loops containing branches must comply with certain requirements to be vectorized
- branches into a loop from outside prohibit vectorization
- loops containing backward branches cannot be vectorized since a backward branch is itself a loop
- loops with forward branches permit vectorization
- arithmetic IFs which are obsolete are not vectorizable since they have multiple destinations
- Transformation - compiler's first phase FPP removes vectorization inhibitors:
- converts IF loops to DO loops when they have a single entrance and a single exit
- compiler can analyse any combination of conditional assignments, conditional and unconditional forward branching, and block IFs.
- Loop nest restructuring
- CF77 examines IF and DO loops within a nest of loops for possible optimization
- Examines loops from innermost loops outward until a nontranslatable construct is reached.
DO I=1, N !N>>10
DO J=1,10
A(I,J) = B(I,J) * C(I,J)
END DO
END DO
- Loop nest restructuring
- Restructured or reordered so that the longest vectorizable loop is the innermost loop
DO J=1,10
DO I=1, N
A(I,J) = B(I,J) * C(I,J)
END DO
END DO
Loop optimizations
- Loop collapse - converts a nest of loops into a single loop with a large iteration count
- Loop fusion - combines consecutive loops with no statements between them
DO I=1,N
A(I) = B(I) + C(I)
END DO
DO I=2, N+1
D(I) = E(I) **2
END DO
Combined into
DO I=1,N
A(I) = B(I) + C(I)
D(I+1) = E(I+1) **2
END DO
- Source-level loop unrolling - makes a copy of the loop body for every iteration to be executed eg
DO I =1, N
DO J=1, 4
A(I,J) = B(I,J) * C(I,J)
END DO
END DO
DO I =1, N !Unrolled - vertically
A(I,1) = B(I,1) * C(I,1)
A(I,2) = B(I,2) * C(I,2)
A(I,3) = B(I,3) * C(I,3)
A(I,4) = B(I,4) * C(I,4) !etc
- Source-level loop unrolling
DO I =1, N
T = 0
DO J=1, 4
T = T + B(I,J)
END DO
A(I) = A(I) + T
END DO
DO I =1, N !Unrolled horizontally
A(I) = A(I) + B(I,1)+B(I,2)+B(I,3)+B(I,4)
END DO
Common optimization techniques
- Loop splitting - split non-vectorizing loops into loops containing vectorizable and non-vectorizable statements
DO I=2, N
A(I) = A(I-1) * D(I)
B(I) = A(I) + C(I-1) **2
END DO
DO I=2, N ! Modified
A(I) = A(I-1) * D(I)
END DO
DO I=2, N
B(I) = A(I) + C(I-1) **2
END DO
- Subroutine in-lining - subroutine and function calls inside a loop prevent vectorization. The loop can be vectorized by bringing the code from the subroutine or function in-line. This can be done manually, using the compiler, or for functions by using a statement function.
- Promoting scalar to vector - scalar recurrences may be avoided using a temporary vector eg
DO J=1, M
S = BB
DO I=1, N
S = S*C
A(I) = A(I) + S
END DO
END DO
- Promoting scalar to vector modified code
DO I=1, M
TV(I) = BB
END DO
DO I=1, N
DO J=1, M
TV(J) = TV(J) * C
A(I) = A(I) + TV(J)
END DO
END DO
- Using the PARAMETER statement - can improve compiler optimization by providing more compile time information about loop lengths and potential data dependencies eg
DO I=2, N
A(I-1) = A(I-S) * B(I) !CONDITIONAL
END DO
Modified
PARAMETER (S=1)
...
DO I=2, N
A(I-1) = A(I-S) * B(I) !VECTOR
END DO
- IF block ordering - placing the most frequently executed conditions first in block IFs eg
DO I=1, N
IF (A(I) .EQ. 0) THEN
B(I) = AZERO !Rare
ELSEIF (A(I) .LT. 0) THEN
B(I) = ANEG !Rare
ELSE
B(I) = C(I) * D(I)/A(I)
ENDIF
END DO
- IF re-ordering modified code
DO I=1, N
IF (A(I) .GT. 0) THEN
B(I) = C(I) * D(I)/A(I) !Frequent
ELSEIF (A(I) .EQ. 0) THEN
B(I) = AZERO
ELSE
B(I) = ANEG
ENDIF
END DO
Compiler vectorization directives
VECTOR/NOVECTOR Turns vectorization on or off
NEXTSCALAR Disables vectorization for the next DO loop between the directive and the end of the program unit
IVDEP (SAFEVL = n) Indicates that any dependencies can be ignored if the vector length does not exceed n - only used when it is known that any apparent dependencies will not cause invalid results if a loop is vectorized
VFUNCTION Declares that a vector version of an external function exists
SHORTLOOP Allows the compiler to generate faster code when a loop's trip count is 64 or less
VSEARCH/NOVSEARCH Default is VSEARCH as this enables vectorization of all search loops until a NOVSEARCH is encountered
RECURRENCE/NORECURRENCE
Enables or disables vectorization of all reduction loops again until a NORECURRENCE is encountered or the end of the program unit
[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker