The Queen's University of Belfast
Parallel Computer Centre

[Next] [Previous] [Top]

2 Vectorization


2.1 Vector Hardware

Vector computers have hardware to perform the vector operations efficiently. Operands can not be used directly from memory but rather are loaded into registers and are put back in registers after the operation. Vector hardware has the special ability to overlap or pipeline operand processing.



Vector functional units

pipelined, fully segmented

each stage of the pipeline performs a step of the function on different operand(s)

once pipeline is full, a new result is produced each clock period (cp).

2.1.1 Pipelining

The pipeline is divided up into individual segments, each of which is completely independent and involves no hardware sharing. This means that the machine can be working on separate operands at the same time. This ability enables it to produce one result per clock period as soon as the pipeline is full. The same instruction is obeyed repeatedly using the pipeline technique so the vector processor processes all the elements of a vector in exactly the same way. The pipeline segments arithmetic operation such as floating point multiply into stages passing the output of one stage to the next stage as input. The next pair of operands may enter the pipeline after the first stage has processed the previous pair of operands. The processing of a number of operands may be carried out simultaneously.

The loading of a vector register is itself a pipelined operation, with the ability to load one element each clock period after some initial startup overhead.

2.1.2 Chaining

Theoretical speedup depends on the number of segments in the pipeline so there is a direct relationship between the number of stages in the pipeline you can keep full and the performance of the code. The size of the pipeline can be increased by chaining thus the Cray combines more than one pipeline to increase its effective size. Chaining means that the result from a pipeline can be used as an operand in a second pipeline as illustrated in the next diagram.



S(I) = A * X(I) + Y(I)



This example shows how two pipelines can be chained together to form an effectively single pipeline containing more segments. The output from the first segment is fed directly into the second set of segments thus giving a resultant effective pipeline length of 8. Speedup (over scalar code) is dependent on the number of stages in the pipeline. Chaining increases the number of stages.

2.2 Comparison - vector and scalar operation

A comparison of scalar versus vector operation is shown in the following diagram which neglects to show the overhead incurred in starting a pipeline.

A scalar operation works on only one pair of operands from the S register and returns the result to another S register whereas a vector operation can work on 64 pairs of operands together to produce 64 results executing only one instruction. Computational efficiency is achieved by processing each element of a vector identically eg initialising all the elements of a vector to zero. A vector instruction provides iterative processing of successive vector register elements by obtaining the operands from the first element of one or more V registers and delivering the result to another V register. Successive operand pairs are transmitted to a functional unit in each clock period so that the first result emerges after the start up time of the functional unit and successive results appear each clock cycle.

Vector overhead is larger than scalar overhead, one reason being the vector length which has to be computed to determine how many vector registers are going to be needed (ie the number of elements divided by 64).

Each vector register can hold up to 64 words so vectors can only be processed in 64 element segments. This is important when it comes to programming as one situation to be avoided is where the number of elements to be processed exceeds the register capacity by a small amount eg a vector length of 65. What happens in this case is that the first 64 elements are processed from one V register, the 65th element must then be processed using a separate register, after the first 64 elements have been processed. The functional unit will process this element in a time equal to the start up time instead of one clock cycle hence reducing the computational efficiency. There is a sharp decrease in performance at each point where the vector length spills over into a new register.

The Cray can receive a result by a vector register and retransmit it as an operand to a subsequent operation in the same clock period. In other words a register may be both a result and an operand register which allows the chaining of two or more vector operations together as seen earlier. In this way two or more results may be produced per clock cycle.

Parallelism is also possible as the functional units can operate concurrently and two or more units may be co-operating at once. This combined with chaining, using the result of one functional unit as the input of another, leads to very high processing speeds.

2.2.1 Scalar and vector processing examples

DO 10 I = 1, 3

JJ(I) = KK(I)+LL(I)

10 CONTINUE

SCALAR PROCESSING

Read one element of Fortran array KK

Read one element of LL

Add the results

Write the results to the Fortran array JJ

Increment the loop index by 1

Repeat the above sequence for each succeeding array element until the loop index equals its limit.

VECTOR PROCESSING

Load a series of elements from array KK to a vector register and a series of elements from array LL to another vector register (these operations occur simultaneously except for instruction issue time)

Add the corresponding elements from the two vector registers and send the results to another vector register, representing array JJ

Store the register used for array JJ to memory

This sequence would be repeated if the array had more elements than the maximum elements used in vector processing ie 64.

PROCESSING ORDER AND RESULTS

Inherent to vector processing is a change in the order of operations to be performed on individual array elements, for any loop that includes two separate vectorized operations. The following example illustrates this:

DO 10 I =1, 3

L(I) = J(I) + K(I)

N(I) = L(I) + M(I)

10 CONTINUE

SCALAR VERSION

The two statements within this loop are each executed three times, with the operations alternating;

L(I) is calculated before N(I) in each iteration

the new value of L(I) is used to calculate the value of N(I).

RESULTS OF SCALAR PROCESSING

Event Operation Values

1 L(1) = J(1)+K(1) 7 = 2 + 5

2 N(1) = L(1)+M(1) 11 = 7 + 4

3 L(2) = J(2)+K(2) -1 = (-4) + 3

4 N(2) = L(2)+M(2) 5 = (-1) + 6

5 L(3) = J(3)+K(3) 15 = 7 + 8

6 N(3) = L(3)+M(3) 13 = 15 + (-2)

VECTOR VERSION

With vector processing the first line within the loop processes all elements of the array before the second line is executed.

RESULTS OF VECTOR PROCESSING

Event Operation Values

1 L(1) = J(1)+K(1) 7 = 2 + 5

2 L(2) = J(2)+K(2) -1 = (-4) + 3

3 L(3) = J(3)+K(3) 15 = 7 + 8

4 N(1) = L(1)+M(1) 11 = 7 + 4

5 N(2) = L(2)+M(2) 5 = (-1) + 6

6 N(3) = L(3)+M(3) 13 = 15 + (-2)

NB Both processing methods produce the same results for each array element.

2.3 Vector Performance

This is governed by a number of factors eg


Architecture (machine dependent factors)

relative scalar to vector instruction speed

pipeline length

pipeline startup cost

Software (dependent factors)

vector length ie using optimal vector length

vector code fraction ie increasing this proportion of the code

2.4 General Requirements for vectorization

DO loops, DO WHILE loops and IF loops can be vectorized if they fulfil the following criteria:-

In a nested structure, only the innermost loops can be vectorized

Vector and scalar versions of code must give equivalent results

Other requirements are based on hardware considerations and limits on the complexity of the loop.

2.5 Conditions that inhibit vectorization

Vector code cannot result if any of the following are in an innermost loop:-

references to external code that cannot be vectorized eg I/O statements (except in an implied DO loop), references to functions which do not have vector versions, references to external subroutines or functions that is not expanded inline, and any RETURN, STOP, or PAUSE statement as these generate library calls.

Obsolete conditional statements: arithmetic IF, assigned GOTO and computed GOTO.

Backward branches other than the one which forms the loop.

The presence of source directives NOVECTOR, NEXTSCALAR or SUPPRESS.

A statement branch into the loop from outside the loop.

Array bounds checking.

Dependencies ie constructs producing different results in scalar and vector mode eg recurrences and ambiguous subscript references.

2.5.1 Examples of vectorization inhibitors

Subroutine calls

DO I = 1, N

CALL STEP1(A,B,C) !SCALAR

D(I) = A(K) *B(K) + CONST

CALL STEP2(A,B,C,D)

END DO

External function calls (intrinsic functions OK)

DO I = 1, N

A(I) = ADD(B(I), C(I)) !SCALAR

END DO

DO I = 1, N

A(I) = SIN(B(I)) !VECTOR

END DO

Character variables

CHARACTER*80 ABC(100)

...

DO I = 1, 100

ABC(I) = `Eschew Obscuration` !SCALAR

END DO

Input/Output statements

DO I = 1, N

READ(20) A(I) !SCALAR

END DO

READ(2) (A(I), I=1, 100) !VECTOR registers

Assigned GOTO statements

DO 220 I = 1, N

GOTO IJK (200,210) !SCALAR

200 CONTINUE

X(I) = A(I)

GOTO 220

210 CONTINUE

X(I) = B(I)

220 CONTINUE

Arithmetic IF

DO 290 I =1, N

IF (IJK(I)), 260, 270, 280 !SCALAR

260 CONTINUE

A(I) = 0.

GOTO 290

270 CONTINUE

B(I) = 0.

GOTO 290

280 CONTINUE

C(I) = 0.

290 CONTINUE

Recursion (data dependency) ie a variable computed in one statement is subsequently used in another statement (or loop iteration)

DO I=2,N

A(I) = A(I-1) * X(I) !SCALAR

END DO

Recursion ie an expression in one iteration of a loop requires a value that was defined by an expression in a previous iteration of that loop

ALPHA(1) = A(1)

DO 60 I=2, N

ALPHA(I) = A(I) - (B(I)*C(I-1))/ALPHA(I-1)

60 CONTINUE

Branching due to IF or GOTO statements may inhibit or degrade vectorization

DO I = 1, N

IF (X(I) .LE.ALIM) THEN

A(I) = B(I) * 2.0 + C(I)

ELSE

A(I) = D(I) * 5.0 + E(I)

END IF

END DO

2.6 Memory contention

The performance improvement expected with vectorization can be degraded by memory contention ie the conflict between successive accesses to the same memory bank. Each access to a location causes that bank to be unavailable for some number of clock cycles and any attempts to access a bank that is currently unavailable are held until that bank is free.

Consecutive Fortran array elements are stored in distinct memory banks which are numbered consecutively. Memory addresses for elements in a two-dimensional array are in the order corresponding to elements moving down a column of an array; therefore, a single column is stored in different banks, and a row is stored all in one bank. Memory conflicts can arise with successive accesses to the same memory bank eg accessing elements consecutively in a row as in the following example:

REAL A(256,100), B(256,100)

...

DO I=1, 256

DO J=1, 100 ! Accesses rows - stride 256

A(I,J) = B(I,J) * 2

END DO

END DO

The order of access is A(1,1), A(1,2) A(1,3) ie consecutive elements in a row. A row is stored in one memory bank, so each successive load is held (delayed). The performance of the vector load and store is degraded depending on the bank-busy time for your system's memory. A vector load and store without memory contention runs 5 to 8 times faster than the same vector load and store with the greatest memory contention.

The command target shows how many memory banks are configured for the system, and the length (in clock periods) of a memory hold on the system as indicated by the heading bankbusy =.

2.7 Memory optimization

Efficient memory access can be ensured by writing loops so that elements are accessed down columns (incrementing the first subscript on consecutive elements) and vectors to be loaded or stored have odd strides. An odd stride guarantees that successive accesses are not made to the same memory bank.

The previous example code can be modified in the following way to produce a stride of 1 with access down the columns:-

REAL A(256,100), B(256,100)

...

DO J=1, 100

DO I=1, 256 ! Accesses columns - stride 1

A(I,J) = B(I,J) * 2

END DO

END DO

OR

The following example retains memory accesses along rows but with a stride of one because each column length is now 257:

REAL A(257,100), B(257,100) ! Column length 257

...

DO I=1, 256

DO J=1, 100

A(I,J) = B(I,J) * 2

END DO

END DO

2.8 Amdahl's Law for vectorization

It is common for programs to be 70% to 80% vectorized ie 70% to 80% of their running time is spent executing vector instructions.

The speedup of the whole program is lower than the speedup of a single loop due to Amdahl's law which states that the performance of a program is dominated by its slowest component which in the case of a vectorized program is scalar code.

Amdahl's Law - the formulation for vector code which is R times faster than scalar code is:

sv = maximum expected speedup from vectorization

fv = fraction of a program that is vectorized

fs = fraction of a program that is scalar = 1 - fv

Rv = ratio of scalar to vector processing time

For Cray Research systems Rv ranges from 10 to 20.

It is not always easy to reach 70% to 80% vectorization in a program and vectorizing beyond this level becomes increasingly difficult as it normally requires major changes to the algorithm. Many users stop their vectorization efforts once the vectorized code is running 2 to 4 times faster than scalar code.

2.9 Terminology

Vector a series of values on which instructions operate eg array or array subsets (such as columns, rows, diagonals in which the intervals between the array element's locations are constant)

Vector Length number of elements in a vector ie maximum of 64 so that an array with 640 elements must be divided into 10 vectors

Stride interval between memory locations for successive elements of a vector, constant stride is an interval that is the same for all consecutive elements of a vector - vectorization requires a constant stride

Dependency anything which causes scalar and vector code to give different results, a recurrence or data dependency is an expression within a loop that requires a value calculated in a previous iteration of the loop in order to be evaluated

Chime a sequence of vector operations that can be chained with a single vector load and store - the limitation on such a sequence is that the same vector functional unit cannot be used twice in the same chain.

Constant increment integer scalar integer variable whose value changes a fixed amount on every iteration of a loop

Vector array reference an array which is referenced within a loop by a constant increment integer

Example DO 30 I=1, 1000

A(I) = B(I) * 5.0

30 CONTINUE

B = vector array reference

I = constant increment integer

Indirect-address vector an array which is referenced by an index array

Invariant array elements array references which do not change within a loop

Example DO 40 I =1, N

A(I) = B(INDX(I)) + C(K)

40 CONTINUE

INDX = indirect-address vector

K = invariant array element

Loop induction variable constant increment integer used in an array reference

Invariant expression a constant, variable, or expression which does not change within a loop

Scalar temporaries scalar variable set equal to a vectorizable expression and used later in the loop on the right hand side of an assignment statement

Example DO 50 I =1, N

TMP = F(I) - 32.0

C(I) = TFVAP/TCVAP * TMP

50 CONTINUE

I = loop induction variable

TFVAP/TCPVAP = invariant expression

TMP = scalar temporary

Vectorizable expression arithmetic or logical expression that consists of any combination of

invariant expressions

loop induction variables

vector array references

scalar temporaries

intrinsic functions

Vectorizable loop innermost loop, contains only vectorizable expressions or no vectorization inhibitors


[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker