Vector functional units
pipelined, fully segmented
each stage of the pipeline performs a step of the function on different operand(s)
once pipeline is full, a new result is produced each clock period (cp).
The loading of a vector register is itself a pipelined operation, with the ability to load one element each clock period after some initial startup overhead.
S(I) = A * X(I) + Y(I)
This example shows how two pipelines can be chained together to form an effectively single pipeline containing more segments. The output from the first segment is fed directly into the second set of segments thus giving a resultant effective pipeline length of 8. Speedup (over scalar code) is dependent on the number of stages in the pipeline. Chaining increases the number of stages.
A scalar operation works on only one pair of operands from the S register and returns the result to another S register whereas a vector operation can work on 64 pairs of operands together to produce 64 results executing only one instruction. Computational efficiency is achieved by processing each element of a vector identically eg initialising all the elements of a vector to zero. A vector instruction provides iterative processing of successive vector register elements by obtaining the operands from the first element of one or more V registers and delivering the result to another V register. Successive operand pairs are transmitted to a functional unit in each clock period so that the first result emerges after the start up time of the functional unit and successive results appear each clock cycle.
Vector overhead is larger than scalar overhead, one reason being the vector length which has to be computed to determine how many vector registers are going to be needed (ie the number of elements divided by 64).
Each vector register can hold up to 64 words so vectors can only be processed in 64 element segments. This is important when it comes to programming as one situation to be avoided is where the number of elements to be processed exceeds the register capacity by a small amount eg a vector length of 65. What happens in this case is that the first 64 elements are processed from one V register, the 65th element must then be processed using a separate register, after the first 64 elements have been processed. The functional unit will process this element in a time equal to the start up time instead of one clock cycle hence reducing the computational efficiency. There is a sharp decrease in performance at each point where the vector length spills over into a new register.
The Cray can receive a result by a vector register and retransmit it as an operand to a subsequent operation in the same clock period. In other words a register may be both a result and an operand register which allows the chaining of two or more vector operations together as seen earlier. In this way two or more results may be produced per clock cycle.
Parallelism is also possible as the functional units can operate concurrently and two or more units may be co-operating at once. This combined with chaining, using the result of one functional unit as the input of another, leads to very high processing speeds.
JJ(I) = KK(I)+LL(I)
10 CONTINUE
SCALAR PROCESSING
Read one element of Fortran array KK
Read one element of LL
Add the results
Write the results to the Fortran array JJ
Increment the loop index by 1
Repeat the above sequence for each succeeding array element until the loop index equals its limit.
VECTOR PROCESSING
Load a series of elements from array KK to a vector register and a series of elements from array LL to another vector register (these operations occur simultaneously except for instruction issue time)
Add the corresponding elements from the two vector registers and send the results to another vector register, representing array JJ
Store the register used for array JJ to memory
This sequence would be repeated if the array had more elements than the maximum elements used in vector processing ie 64.
PROCESSING ORDER AND RESULTS
Inherent to vector processing is a change in the order of operations to be performed on individual array elements, for any loop that includes two separate vectorized operations. The following example illustrates this:
DO 10 I =1, 3
L(I) = J(I) + K(I)
N(I) = L(I) + M(I)
10 CONTINUE
SCALAR VERSION
The two statements within this loop are each executed three times, with the operations alternating;
L(I) is calculated before N(I) in each iteration
the new value of L(I) is used to calculate the value of N(I).
RESULTS OF SCALAR PROCESSING
Event Operation Values
1 L(1) = J(1)+K(1) 7 = 2 + 5
2 N(1) = L(1)+M(1) 11 = 7 + 4
3 L(2) = J(2)+K(2) -1 = (-4) + 3
4 N(2) = L(2)+M(2) 5 = (-1) + 6
5 L(3) = J(3)+K(3) 15 = 7 + 8
6 N(3) = L(3)+M(3) 13 = 15 + (-2)
VECTOR VERSION
With vector processing the first line within the loop processes all elements of the array before the second line is executed.
RESULTS OF VECTOR PROCESSING
Event Operation Values
1 L(1) = J(1)+K(1) 7 = 2 + 5
2 L(2) = J(2)+K(2) -1 = (-4) + 3
3 L(3) = J(3)+K(3) 15 = 7 + 8
4 N(1) = L(1)+M(1) 11 = 7 + 4
5 N(2) = L(2)+M(2) 5 = (-1) + 6
6 N(3) = L(3)+M(3) 13 = 15 + (-2)
NB Both processing methods produce the same results for each array element.
Architecture (machine dependent factors)
relative scalar to vector instruction speed
pipeline length
pipeline startup cost
Software (dependent factors)
vector length ie using optimal vector length
vector code fraction ie increasing this proportion of the code
In a nested structure, only the innermost loops can be vectorized
Vector and scalar versions of code must give equivalent results
Other requirements are based on hardware considerations and limits on the complexity of the loop.
references to external code that cannot be vectorized eg I/O statements (except in an implied DO loop), references to functions which do not have vector versions, references to external subroutines or functions that is not expanded inline, and any RETURN, STOP, or PAUSE statement as these generate library calls.
Obsolete conditional statements: arithmetic IF, assigned GOTO and computed GOTO.
Backward branches other than the one which forms the loop.
The presence of source directives NOVECTOR, NEXTSCALAR or SUPPRESS.
A statement branch into the loop from outside the loop.
Array bounds checking.
Dependencies ie constructs producing different results in scalar and vector mode eg recurrences and ambiguous subscript references.
DO I = 1, N
CALL STEP1(A,B,C) !SCALAR
D(I) = A(K) *B(K) + CONST
CALL STEP2(A,B,C,D)
END DO
External function calls (intrinsic functions OK)
DO I = 1, N
A(I) = ADD(B(I), C(I)) !SCALAR
END DO
DO I = 1, N
A(I) = SIN(B(I)) !VECTOR
END DO
Character variables
CHARACTER*80 ABC(100)
...
DO I = 1, 100
ABC(I) = `Eschew Obscuration` !SCALAR
END DO
Input/Output statements
DO I = 1, N
READ(20) A(I) !SCALAR
END DO
READ(2) (A(I), I=1, 100) !VECTOR registers
Assigned GOTO statements
DO 220 I = 1, N
GOTO IJK (200,210) !SCALAR
200 CONTINUE
X(I) = A(I)
GOTO 220
210 CONTINUE
X(I) = B(I)
220 CONTINUE
Arithmetic IF
DO 290 I =1, N
IF (IJK(I)), 260, 270, 280 !SCALAR
260 CONTINUE
A(I) = 0.
GOTO 290
270 CONTINUE
B(I) = 0.
GOTO 290
280 CONTINUE
C(I) = 0.
290 CONTINUE
Recursion (data dependency) ie a variable computed in one statement is subsequently used in another statement (or loop iteration)
DO I=2,N
A(I) = A(I-1) * X(I) !SCALAR
END DO
Recursion ie an expression in one iteration of a loop requires a value that was defined by an expression in a previous iteration of that loop
ALPHA(1) = A(1)
DO 60 I=2, N
ALPHA(I) = A(I) - (B(I)*C(I-1))/ALPHA(I-1)
60 CONTINUE
Branching due to IF or GOTO statements may inhibit or degrade vectorization
DO I = 1, N
IF (X(I) .LE.ALIM) THEN
A(I) = B(I) * 2.0 + C(I)
ELSE
A(I) = D(I) * 5.0 + E(I)
END IF
END DO
Consecutive Fortran array elements are stored in distinct memory banks which are numbered consecutively. Memory addresses for elements in a two-dimensional array are in the order corresponding to elements moving down a column of an array; therefore, a single column is stored in different banks, and a row is stored all in one bank. Memory conflicts can arise with successive accesses to the same memory bank eg accessing elements consecutively in a row as in the following example:
REAL A(256,100), B(256,100)
...
DO I=1, 256
DO J=1, 100 ! Accesses rows - stride 256
A(I,J) = B(I,J) * 2
END DO
END DO
The order of access is A(1,1), A(1,2) A(1,3) ie consecutive elements in a row. A row is stored in one memory bank, so each successive load is held (delayed). The performance of the vector load and store is degraded depending on the bank-busy time for your system's memory. A vector load and store without memory contention runs 5 to 8 times faster than the same vector load and store with the greatest memory contention.
The command target shows how many memory banks are configured for the system, and the length (in clock periods) of a memory hold on the system as indicated by the heading bankbusy =.
The previous example code can be modified in the following way to produce a stride of 1 with access down the columns:-
REAL A(256,100), B(256,100)
...
DO J=1, 100
DO I=1, 256 ! Accesses columns - stride 1
A(I,J) = B(I,J) * 2
END DO
END DO
OR
The following example retains memory accesses along rows but with a stride of one because each column length is now 257:
REAL A(257,100), B(257,100) ! Column length 257
...
DO I=1, 256
DO J=1, 100
A(I,J) = B(I,J) * 2
END DO
END DO
The speedup of the whole program is lower than the speedup of a single loop due to Amdahl's law which states that the performance of a program is dominated by its slowest component which in the case of a vectorized program is scalar code.
Amdahl's Law - the formulation for vector code which is R times faster than scalar code is:
sv = maximum expected speedup from vectorization
fv = fraction of a program that is vectorized
fs = fraction of a program that is scalar = 1 - fv
Rv = ratio of scalar to vector processing time
For Cray Research systems Rv ranges from 10 to 20.
It is not always easy to reach 70% to 80% vectorization in a program and vectorizing beyond this level becomes increasingly difficult as it normally requires major changes to the algorithm. Many users stop their vectorization efforts once the vectorized code is running 2 to 4 times faster than scalar code.
Vector Length number of elements in a vector ie maximum of 64 so that an array with 640 elements must be divided into 10 vectors
Stride interval between memory locations for successive elements of a vector, constant stride is an interval that is the same for all consecutive elements of a vector - vectorization requires a constant stride
Dependency anything which causes scalar and vector code to give different results, a recurrence or data dependency is an expression within a loop that requires a value calculated in a previous iteration of the loop in order to be evaluated
Chime a sequence of vector operations that can be chained with a single vector load and store - the limitation on such a sequence is that the same vector functional unit cannot be used twice in the same chain.
Constant increment integer scalar integer variable whose value changes a fixed amount on every iteration of a loop
Vector array reference an array which is referenced within a loop by a constant increment integer
Example DO 30 I=1, 1000
A(I) = B(I) * 5.0
30 CONTINUE
B = vector array reference
I = constant increment integer
Indirect-address vector an array which is referenced by an index array
Invariant array elements array references which do not change within a loop
Example DO 40 I =1, N
A(I) = B(INDX(I)) + C(K)
40 CONTINUE
INDX = indirect-address vector
K = invariant array element
Loop induction variable constant increment integer used in an array reference
Invariant expression a constant, variable, or expression which does not change within a loop
Scalar temporaries scalar variable set equal to a vectorizable expression and used later in the loop on the right hand side of an assignment statement
Example DO 50 I =1, N
TMP = F(I) - 32.0
C(I) = TFVAP/TCVAP * TMP
50 CONTINUE
I = loop induction variable
TFVAP/TCPVAP = invariant expression
TMP = scalar temporary
Vectorizable expression arithmetic or logical expression that consists of any combination of
invariant expressions
loop induction variables
vector array references
scalar temporaries
intrinsic functions
Vectorizable loop innermost loop, contains only vectorizable expressions or no vectorization inhibitors