The Queen's University of Belfast
Parallel Computer Centre

[Next] [Previous] [Top]

VECTOR PROCESSING


Vector Processing





Vector Multiply

DO 10 I=1,100

A(I) = B(I) * C(I)

10 CONTINUE

Pseudo Code

VLOAD B VR1

VLOAD C VR2

VMULT VR1 VR2 VR3

VSTORE VR3 C

Chaining

The results of one vector operation may be fed into another. Multiple vector operations can be chained into one long operation - changing the ORDER of processing.

Vector Load : 17 clock cycles

Vector Multiply : 4 clock cycles

Vector Store : 17 clock cycles

Total : 38 clock cycles

Vector Hardware

Vector functional units



Vector registers and pipeline

Pipelining and Chaining

S(I) = A * X(I) + Y(I)



Comparison - vector and scalar operation

Scalar and Vector Processing Example

DO 10 I = 1, 256

JJ(I) = KK(I)+LL(I)

10 CONTINUE

Both methods of processing must produce the same results for each array element.

Scalar Processing

Vector Processing

Scalar and Vector Processing Example

DO 10 I = 1, 3

L(I) = J(I)+K(I)

N(I) = L(I)+M(I)

10 CONTINUE

Both methods of processing must produce the same results for each array element.

Scalar Processing Order and Results

L(I) is calculated before N(I) in each iteration L(I) is then used to calculate the value of N(I).

Results of Scalar processing

Event Operation Values

1 L(1) = J(1)+K(1) 7 = 2 + 5

2 N(1) = L(1)+M(1) 11 = 7 + 4

3 L(2) = J(2)+K(2) -1 = (-4) + 3

4 N(2) = L(2)+M(2) 5 = (-1) + 6

5 L(3) = J(3)+K(3) 15 = 7 + 8

6 N(3) = L(3)+M(3) 13 = 15 + (-2)

Vector Processing Order and Results

Results of Vector processing

Event Operation Values

1 L(1) = J(1)+K(1) 7 = 2 + 5

2 L(2) = J(2)+K(2) -1 = (-4) + 3

3 L(3) = J(3)+K(3) 15 = 7 + 8

4 N(1) = L(1)+M(1) 11 = 7 + 4

5 N(2) = L(2)+M(2) 5 = (-1) + 6

6 N(3) = L(3)+M(3) 13 = 15 + (-2)

Vector Performance

This is governed by a number of factors eg

Terminology

Terminology Examples

DO 30 I=1, 1000

A(I) = B(I) * 5.0

30 CONTINUE

B = vector array reference

I = constant increment integer

DO 40 I =1, N

A(I) = B(INDX(I)) + C(K)

40 CONTINUE

INDX = indirect-address vector

K = invariant array element

DO 50 I =1, N

TMP = F(I) - 32.0

C(I) = TFVAP/TCVAP * TMP

50 CONTINUE

I = loop induction variable

TFVAP/TCPVAP = invariant expression

TMP = scalar temporary

General Requirements for vectorization

DO loops, DO WHILE loops and IF loops can be vectorized if they fulfil the following criteria:-

Conditions inhibiting vectorization

In an innermost loop inhibitors are:

Vectorization inhibitors

DO I = 1, N

CALL STEP1(A,B,C) !SCALAR

D(I) = A(K) *B(K) + CONST

END DO

DO I = 1, N

A(I) = ADD(B(I), C(I)) !SCALAR

END DO

DO I = 1, N

A(I) = SIN(B(I)) !VECTOR

END DO

CHARACTER*80 ABC(100)

...

DO I = 1, 100

ABC(I) = `Eschew Obscuration` !scalar

END DO

DO I = 1, N

READ(20) A(I) !SCALAR

END DO

READ(2) (A(I), I=1, 100) !vector version

DO 220 I = 1, N

GOTO IJK (200,210) !SCALAR

200 CONTINUE

X(I) = A(I)

GOTO 220

210 CONTINUE

X(I) = B(I)

220 CONTINUE

DO 290 I =1, N

IF (IJK(I)), 260, 270, 280 !SCALAR

260 CONTINUE

A(I) = 0.

GOTO 290

270 CONTINUE

B(I) = 0.

GOTO 290

280 CONTINUE

C(I) = 0.

290 CONTINUE

DO I=2,N

A(I) = A(I-1) * X(I) !scalar

END DO

ALPHA(1) = A(1)

DO 60 I=2, N

ALPHA(I) = A(I) - (B(I)*C(I-1))/ALPHA(I-1)

60 CONTINUE

DO I = 1, N

IF (X(I) .LE.ALIM) THEN

A(I) = B(I) * 2.0 + C(I)

ELSE

A(I) = D(I) * 5.0 + E(I)

END IF

END DO

Memory contention and optimization

Vectorization performance can be degraded by memory contention ie conflict between successive accesses to the same memory bank.

Accessing elements consecutively in a row eg

REAL A(256,100), B(256,100)

...

DO I=1, 256

DO J=1, 100 ! Accesses rows - stride 256

A(I,J) = B(I,J) * 2

END DO

END DO

Memory contention

Example

REAL A(256,100), B(256,100)

DO J=1, 100

DO I=1, 256 ! Accesses columns - stride 1

A(I,J) = B(I,J) * 2

END DO

END DO

The following example retains memory accesses along rows but with a stride of one because each column length is now 257:

REAL A(257,100),B(257,100)

* Column length 257

...

DO I=1, 256

DO J=1, 100

A(I,J) = B(I,J) * 2

END DO

END DO

STRIDE EXAMPLES

DO 300 J = 1, 3

DO 300 I = 1, 4

MATRIX(I,J) = MATRIX(I,J)+1.0

300 CONTINUE

DO 310 I=1, 4

DO 310 J =1, 3

MATRIX(I,J) = MATRIX(I,J)+1.0

310 CONTINUE

DO 320 I=1, 4

MATRIX(I,I) = MATRIX(I,I)+1.0

300 CONTINUE

Layout of a 2D Array in Memory



Amdahl's Law for vectorization

Amdahl's Law - the formulation for vector code which is R times faster than scalar code is:

sv = maximum expected speedup

fv = fraction of a program that is vectorized

fs = fraction of a program that is scalar = 1 - fv

Rv = ratio of scalar to vector processing time

For Cray Research systems Rv ranges from 10 to 20.

Dependencies

DO I =2, 4

IB(I) = IA(I-1)

IA(I) = IC(I)

ENDDO

Scalar Processing Order and Results

Results of Scalar processing

Event Operation Values

1 IB(2) = IA(2-1) 3

2 IA(2) = IC(2) 4

3 IB(3) = IA(3-1) 4

4 IA(3) = IC(3) 6

5 IB(4) = IA(4-1) 6

6 IA(4) = IC(4) 8

Vector Processing Order and Results

Results of Vector processing

Event Operation Values

1 IB(2) = IA(2-1) 3

2 IB(3) = IA(3-1) 6

3 IB(4) = IA(4-1) 9

4 IA(2) = IC(2) 4

5 IA(3) = IC(3) 6

6 IA(4) = IC(4) 8

Testing for dependency

Example

DO 20 J = 2, M

Z(J) = YY(J) + TEMPA

R(J) = Z(J + 1)/TEMPB

20 CONTINUE

Array Z is both defined and referenced within the loop ie one appearance is the key definition where array Z is defined and the other reference is where array Z is referenced.

Test the loop using the key definition and the other reference as follows:

Examples which inhibit vectorization

SUBROUTINE SGI(A, B, C)

DO 10 I = 1, 99

A(I) = B(I)

C(I) = A(I+1)

10 CONTINUE

END

SUBROUTINE SLD(A, B, C)

DO 10 I = 100, 2, -1

A(I) = B(I)

C(I) = A(I-1)

10 CONTINUE

END

SUBROUTINE PLI(A, B, C)

DO 10 I = 2, 100

B(I) = A(I-1)

A(I) = C(I)

10 CONTINUE

END

SUBROUTINE PGD(A, B, C)

DO 10 I = 99, 1, -1

B(I) = A(I+1)

A(I) = C(I)

10 CONTINUE

END

Examples which do not inhibit

SUBROUTINE SGD

DO 10 I = 99, 1, -1

A(I) = B(I)

C(I) = A(I+1)

10 CONTINUE

END

SUBROUTINE SLI(A, B, C)

DO 10 I = 2, 100

A(I) = B(I)

C(I) = A(I-1)

10 CONTINUE

END

SUBROUTINE PLD(A, B, C)

DO 10 I = 100, 2, -1

B(I) = A(I-1)

A(I) = C(I)

10 CONTINUE

END

SUBROUTINE PGI(A, B, C)

DO 10 I = 1, 99

B(I) = A(I+1)

A(I) = C(I)

10 CONTINUE

END

Rigorous testing for dependency

A more rigorous test takes into account the stride of the indexes of arrays:-

DO 20 J = 2, M, 2

Z(J) = YY(J) + TEMPA

R(J) = Z(J+1) / TEMPB

20 CONTINUE

index1 = J, index2 = J+1, stride = 2:

(index1-index2) mod stride =

((J+1)-(J)) mod 2 = 1

Vectorizing recurrences

SUBROUTINE SHORT_VL(A)

DIMENSION A(100)

DO 20 I = 7, 100

A(I) = A(I-6) + 1.0

20 CONTINUE

END

SAFE VECTOR LENGTH

PROGRAM safeval;

DIMENSION A(-100:100), B(100), C(100)

K = KFUN(A(I))

N = NFUN(B(I)) ! unknown at compile time

DO I = K, N ! N is vector length

A(I) = A(I) + (I-K)

END DO

SUBROUTINE RUNTIME

DIMENSION A(-100:100), B(100), C(100)

COMMON // J

DO I = 1, 100

C(I) = A(I-J)

A(I) = B(I)

END DO

END

Data dependency directives

Loops

DO I=1, N !N>>10

DO J=1,10

A(I,J) = B(I,J) * C(I,J)

END DO

END DO

DO J=1,10

DO I=1, N

A(I,J) = B(I,J) * C(I,J)

END DO

END DO

Loop optimizations

DO I=1,N

A(I) = B(I) + C(I)

END DO

DO I=2, N+1

D(I) = E(I) **2

END DO

Combined into

DO I=1,N

A(I) = B(I) + C(I)

D(I+1) = E(I+1) **2

END DO

DO I =1, N

DO J=1, 4

A(I,J) = B(I,J) * C(I,J)

END DO

END DO

DO I =1, N !Unrolled - vertically

A(I,1) = B(I,1) * C(I,1)

A(I,2) = B(I,2) * C(I,2)

A(I,3) = B(I,3) * C(I,3)

A(I,4) = B(I,4) * C(I,4) !etc

DO I =1, N

T = 0

DO J=1, 4

T = T + B(I,J)

END DO

A(I) = A(I) + T

END DO

DO I =1, N !Unrolled horizontally

A(I) = A(I) + B(I,1)+B(I,2)+B(I,3)+B(I,4)

END DO

Common optimization techniques

DO I=2, N

A(I) = A(I-1) * D(I)

B(I) = A(I) + C(I-1) **2

END DO

DO I=2, N ! Modified

A(I) = A(I-1) * D(I)

END DO

DO I=2, N

B(I) = A(I) + C(I-1) **2

END DO

DO J=1, M

S = BB

DO I=1, N

S = S*C

A(I) = A(I) + S

END DO

END DO

DO I=1, M

TV(I) = BB

END DO

DO I=1, N

DO J=1, M

TV(J) = TV(J) * C

A(I) = A(I) + TV(J)

END DO

END DO

DO I=2, N

A(I-1) = A(I-S) * B(I) !CONDITIONAL

END DO

Modified

PARAMETER (S=1)

...

DO I=2, N

A(I-1) = A(I-S) * B(I) !VECTOR

END DO

DO I=1, N

IF (A(I) .EQ. 0) THEN

B(I) = AZERO !Rare

ELSEIF (A(I) .LT. 0) THEN

B(I) = ANEG !Rare

ELSE

B(I) = C(I) * D(I)/A(I)

ENDIF

END DO

DO I=1, N

IF (A(I) .GT. 0) THEN

B(I) = C(I) * D(I)/A(I) !Frequent

ELSEIF (A(I) .EQ. 0) THEN

B(I) = AZERO

ELSE

B(I) = ANEG

ENDIF

END DO

Compiler vectorization directives

VECTOR/NOVECTOR Turns vectorization on or off

NEXTSCALAR Disables vectorization for the next DO loop between the directive and the end of the program unit

IVDEP (SAFEVL = n) Indicates that any dependencies can be ignored if the vector length does not exceed n - only used when it is known that any apparent dependencies will not cause invalid results if a loop is vectorized

VFUNCTION Declares that a vector version of an external function exists

SHORTLOOP Allows the compiler to generate faster code when a loop's trip count is 64 or less

VSEARCH/NOVSEARCH Default is VSEARCH as this enables vectorization of all search loops until a NOVSEARCH is encountered

RECURRENCE/NORECURRENCE

Enables or disables vectorization of all reduction loops again until a NORECURRENCE is encountered or the end of the program unit


[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker