CRAY Y-MP EL Student Notes Parallel Processing, QUB

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

5 Parallel Processing

5.1 Capabilities

Hardware Level

parallel instruction execution

vector registers and segmented vector functional units ie pipelining

I/O subsystems or foreground processors

Software Level

concurrent multiprogramming

multiprogramming at the job level

multiprogramming at process level

multitasking

5.2 Evolution of CRI parallel processing software

This consisted of three implementations:

macrotasking - programmers had to modify codes to make use of parallelism ie data scoping and inserting library calls specific to CRI

microtasking - expanded macrotasking requiring less data scoping and compiler directives replaced library calls

autotasking - is the most recent implementation and combines the best features of the previous two with other enhancements

5.3 Autotasking

This can be fully automatic and it can also exploit parallelism at the DO loop level without extending to subroutine boundaries.

The cf77 command serves as an `overcompiler' in that it invokes the appropriate phases of Autotasking, namely - FPP, FMP, the CFT77 compiler and the loader (as illustrated in the following diagram), to build an executable program based on a set of defaults and options.

5.3.1 Goals of autotasking

detect parallelism automatically in a program and exploit it without user intervention

define a syntax by which parallelism is expressed, allowing users to guide the Autotasking system in code segments in which the user can provide additional information to the Autotasking system or where the Autotasking system cannot detect parallelism automatically

define the scope of variables when transferring a program to exploit parallelism

provide a simple command line interface to Autotasking

5.3.2 When to use autotasking

If a program is I/O bound Autotasking will probably make it more so whereas the programs which are good candidates for Autotasking are long-running programs which use so much memory that little else can run in the machine or ones which have hard deadlines.

Programs that are heavily vectorized tend to have a high potential for parallelism. High performance for many codes is achieved when the compiler detects code sequences that can be vectorized and uses its vector registers to run those sequences. It costs a program little to use the vector registers hence it is almost always better to run in vector mode.

When it comes to optimization Autotasking favours vectorization over parallel processing. Autotasking vectorizes the innermost loop of a nest of DO loops and runs the outermost on multiple processors, if dependence analysis permits. Autotasking may even process a single vectorizable DO loop in chunks as if it were a nested pair of loops with a vector inner loop and a parallel outer loop.

Loops do not have to be vectorized for Autotasking to detect that a nest of loops can be run in parallel.

5.3.3 Speedup

Firstly consider how much parallelism does the program contain ie from a known or expected amount of parallelism you can calculate speedup based on Amdahl's law for multitasking ie

Sm = Maximum expected speedup from multitasking (wall-clock time not CPU time)

N = Number of processors available for parallel execution

fp = Fraction of a program that can execute in parallel

fs = Fraction of a program that is serial = 1 - fp

Autotasking may detect and exploit only part or none of the parallelism ie it may not be of sufficient granularity to make the program run faster. Vectorization almost always results in codes running faster whereas Autotasking generally results in speedups but carries a higher risk for slowing down some codes.

5.4 fpp

This is the dependence analysis phase of the CFT77 compiling system which parses the original Fortran source program, looks for parallelism within the program units, and produces a transformed Fortran source file as output. Optimization switches control the optimization of FPP

eg -Wd"-d" and -Wd"-e"

5.5 fmp

FMP is the translation phase and transforms the Fortran source file for multitasking ie it invokes Autotasking.

5.6 fpp loop selection criteria for autotasking

loop iteration count

presence of data dependence

amount of work done within the loop

5.6.1 fpp loop optimization techniques

Loop collapse - automatically collapsing loops into single loops with larger iteration counts.

Loop fusion - combining consecutive loops ie loops which have no statements between them.

Loop rerolling - unrolling iterations of an inner loop into separate statements

Translation of array notation - translates array section syntax into DO loops which can be vectorized and/or autotasked.

Extended parallel regions - tries to combine or expand regions of parallelism to reduce Autotasking run-time overhead.

Parallel cases - when parallelism cannot be found within a loop nest FPP tries to find loops or loop nests that are completely independent of each other and execute them as parallel cases.

Reductions - FPP doesn't autotask loops containing dependencies between loop passes except for the following reduction operations on the elements eg summation, product, minimum, maximum, index of minimum or maximum and dot product. Each task is given a partial reduction to perform and then the partial results are combined as each task finishes.

5.6.2 Additional fpp optimization

Vectorization enhancement

data dependence analysis (subscript clarification, conditional vectorization, minimization of recursion, loop splitting, and loop peeling)

loop nest restructuring (IF conversion, loop fusion, loop unrolling)

Inline expansion

this produces performance benefits from the expansion of the bodies of certain subroutines and functions into the loops that call them which allows the calling loop and the body of the called routine to be optimized

Scalars in loops

scalar variables that are not modified in the loop do not prohibit optimization

5.7 Autotasking performance

Scalar algorithms tend to dominate performance reverting to Amdahl's law which states that performance is dominated by its slowest component. This law is used to forecast both performance for vectorization and autotasking. Vectorization usually decreases both CPU time and wall-clock time whereas multi-tasking decreases only wall-clock time but may even increase CPU time due to the extra code required for starting, stopping and synchronising processors.

Note - single threaded code segments which must use a single processor exist in each program so although some programs approach 100% parallelism virtually no code reaches 100%.

5.7.1 Estimating percentage parallelism within a program

METHOD 1

determine which subroutines consume the majority of program execution eg flowtrace.

identify the subroutines that have significant sections of code that can be executed in parallel ie

heavily vectorized

performance is dominated by nested loops that do not contain CALL statements

performance is dominated by loops with calls to self-contained subroutines or routines which are expanded inline

nested loops where the innermost loop can be vectorized and these loops have a large number of both iterations and operations

programs with loops that do the work of a matrix multiply, a first or second-order recurrence or a search for the index of a minimum or maximum element

add total percentage of execution times (Flowtrace list) for subroutines that have been identified as significantly parallel

METHOD 2 (A MORE ACCURATE ESTIMATE)

execute the program in single CPU mode ie the PROF command which shows what subroutine consumes the most execution time

identify significant sections ie loops of subroutines

add percentage of execution times from PROF for all identified sections of code

5.8 Prerequisites for high performance

Vectorization decreases both wall-clock and CPU time, decreases total job times even in a batch environment and it is easier to write code that is vectorizable than to write code in parallel.

It becomes increasingly difficult to vectorize a program beyond 70% to 80% after this autotasking should be utilised in an attempt to further decrease wall-clock time. Autotasking does not look for areas of parallelism past the scope of the subroutine or function but inlining can eliminate subroutine calls and thus increase the possibility of being able to identify areas that can be run in parallel.

5.8.1 Parallelism and load balancing

There is a direct relationship between the two as the extent of parallelism for any parallel region is determined by the number of partitions or `chunks' of independent work that constitute the region. Load balancing is the process to ensure that the amount of work done by each processor available to the job is approximately equal.

The higher the extent of parallelism the easier it is to balance the work evenly across the processors. Smaller granularity parallelism is easier to balance across available processors but it generates more overhead that large granularity parallelism as synchronization is required each time a `chunk' of work is allocated to a processor.

5.8.2 Overhead

This is the extra execution time created because of the multi-processing process itself ie

time spent waiting on semaphores - processors have to wait on semapores for certain lengths of time while synchronizing

time spent executing extra code for Autotasking - slave processors are acquired upon entering the parallel region and at synchronization points within the parallel region thus incurring overhead. This could add 0% - 5% to the overall execution time

extra memory bank conflicts - created by inter- and intraprocessor memory references can degrade vector performance

possible decreased vector performance - due to using inner-loop autotasking ie shorter vector lengths and more vector loop startups.

5.8.3 Autotasking analysis tools

atexpert

mtdump

ftref

5.8.4 Memory usage

Several steps can be taken to decrease memory usage:

use Fortran SAVE statements for large local arrays ie they have static allocation and are no longer used in initial stack size computation;

store large local arrays in COMMON - they again have static memory location;

specify explicit stack requirements for SEGLDR - multitasked code requires more stack space than unitasked code.

MASTER AND SLAVE TASKS

An autotask program executing code that is not autotasked is called the master task. The master task executes all the serial code, initiates parallel processing when an Autotask region is entered, performs all or part or none of the work in the Autotasking region and waits until the parallel processing is finished before exiting the Autotasking region.

Once Autotasking code is encountered the master task calls an external function to bring other available slave processors into execution. The address of the code to be executed is passed to this function and each slave begins executing at this address. The slave code is in a separate subroutine created by FMP and the variables which are shared with the master task and other slave tasks are passed as arguments to the slave subroutine.

The code executed by the master is distinct from the code executed by the slave task. The master is the original calling routine and contains the initialization and termination code for parallel execution which the slave does not. Also the master code contains a unitasked version of the autotasked code in case the initialization code determines that Autotasking is not appropriate.

5.9 Strategy for debugging autotasked code

If the code produces incorrect results FPP may be making some incorrect transformations. Some suggestions in isolating the problems are:

use atchop to do a binary search of concurrent regions to narrow the problem down to a suspect loop

run code only through FPP to determine if the problem only arises when FMP is used

if a problem doesn't exist when FMP is not used then set NCPUS to 1 which can help isolate variable scoping problems or pinpoint the problem to either the slave or the master task

start with many FPP options and reduce these one at a time to see which transformation is causing the problems - also disable the default options to isolate the problem

use CFP$ SKIP directive to inhibit transformation on specific loops in the code.

5.10 Multitasking terminology

Multitasking/Parallel Processing

one program makes use of multiple processors to execute portions of the program simultaneously

Autotasking automatic distribution of loop iterations to multiple processors (or tasks) using cf77 compiler

Parallel region section of code executed by multiple processors - can be classified as partitioned or redundant

Single-threaded code section of code that is executed by only one processor at a time

Serial code section of code that is executed by only one processor

Partitioned code code within a parallel region in which multiple processors share the work that needs to be done, each processor does a different portion of the work

Redundant code code in a parallel region in which processors duplicate the work that needs to be available to all processors

Data dependency when a computation in one iteration of a loop requires a value computed in another iteration of the loop

Synchronization process of coordinating the steps within concurrent/parallel regions

Master task task that executes all of the serial code, initiates parallel processing, and waits until parallel processing is finished before leaving the Autotasking region

Slave task task initiated by the master task

Directives special lines of code beginning with CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$ that give the compiling system information about a program.

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker