CRAY Y-MP EL Parallel Processing, QUB

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

Parallel Processing

Hardware Level

parallel instruction
- instructions are executed in parallel
- eg an addition may be issued during a multiplication
vector registers and segmented vector functional units ie pipelining
- an instruction may begin before the previous has complete
I/O subsystems or foreground processors
- logically separate processors that perform I/O functions for the operating system in use on the computer system
- operations occur in parallel with a job or processor running in the main processor

Software Level

concurrent multiprogramming
- one processor which switches between jobs/processes so things appear to happen simultaneously
- the processor can work on one job while another job is waiting for an I/O operation
multiprogramming at the job level
- more than one processor means working on as many programs as there are processors
multiprogramming at process level
- each prog submitted is a process
- can create separate processes by placing them in the background or piping output to another process
multitasking
- more than one processor work on one program
Evolution of CRI parallel processing software consisted of three implementations:
- macrotasking - programmers had to modify codes to make use of parallelism ie data scoping and inserting library calls specific to CRI
- microtasking - expanded macrotasking requiring less data scoping and compiler directives replaced library calls
- autotasking - is the most recent implementation and combines the best features of the previous two with other enhancements

Autotasking

The cf77 command serves as an `overcompiler' invoking the appropriate phases of Autotasking, namely - FPP, FMP, the CFT77 compiler and the loader to build an executable program based on a set of defaults and options.
Goals of autotasking
- detect parallelism automatically in a program and exploit it without user intervention
- define a syntax by which parallelism is expressed, allowing users to guide the Autotasking system in code segments in which the user can provide additional information to the Autotasking system or where the Autotasking system cannot detect parallelism automatically
- define the scope of variables when transferring a program to exploit parallelism
- provide a simple command line interface to Autotasking

When to use autotasking
- good candidates for Autotasking are long-running programs which use so much memory that little else can run in the machine or ones which have hard deadlines.
- Programs that are heavily vectorized tend to have a high potential for parallelism. It costs a program little to use the vector registers hence it is almost always better to run in vector mode.
- Autotasking favours vectorization over parallel processing. Autotasking vectorizes the innermost loop of a nest of DO loops and runs the outermost on multiple processors, if dependence analysis permits.
- Loops do not have to be vectorized for Autotasking to detect that a nest of loops can be run in parallel.
Speedup
- calculate speedup based on Amdahl's law for multitasking ie

Sm = Maximum expected speedup from multitasking (wall-clock time not CPU time)

N = Number of processors available for parallel execution

fp = Fraction of a program that can execute in parallel

fs = Fraction of a program that is serial = 1 - f

Vectorization almost always results in codes running faster whereas Autotasking generally results in speedups but carries a higher risk for slowing down some codes.

fpp - dependence analysis phase
- parses the original Fortran source program
- looks for parallelism within the program units
- produces a transformed Fortran source file
- Optimization switches control the optimization of FPP eg -Wd"-d" and -Wd"-e"
fmp - translation phase
- transforms the Fortran source file for multitasking ie it invokes Autotasking.
fpp loop selection criteria for autotasking
- loop iteration count
- presence of data dependence
- amount of work done within the loop

fpp loop optimization techniques
- Loop collapse - ie into single loops with larger iteration counts
- Loop fusion - combining consecutive loops
- Loop rerolling - rerolling iterations of an inner loop into separate statements
- Translation of array notation - into DO loops for vectorizing and/or autotasked.
- Extended parallel regions - combines or expands parallel regions reducing Autotasking run-time overhead.
- Parallel cases - if parallelism is not found within a loop nest, FPP tries to find loops or loop nests which are completely independent of each other and execute them as parallel cases.
- Reductions - doesn't autotask loops containing dependencies ie except for reduction operations on the elements such as sum, product, min, max, index of min or max and dot product.

Additional fpp optimization
- Vectorization enhancement - data dependence analysis (subscript clarification, conditional vectorization, minimization of recursion, loop splitting, and loop peeling) and loop nest restructuring (IF conversion, loop fusion, loop unrolling)
- Inline expansion - expansion of the bodies of certain subroutines and functions into the loops that call them which allows the calling loop and the body of the called routine to be optimized
- Scalars in loops - scalar variables that are not modified in the loop do not prohibit optimization
Autotasking performance
- Scalar algorithms tend to dominate performance
- Vectorization usually decreases both CPU time and wall-clock time but multi-tasking decreases only wall-clock time but may even increase CPU time due to the extra code required for starting, stopping and synchronising processors.

Estimating percentage parallelism within a program

Method 1

which subroutines consume the majority of program execution eg flowtrace.
which subroutines have significant sections of code that can be executed in parallel ie
- heavily vectorized, performance is dominated by nested loops that do not contain CALL statements, calls are to self-contained subroutines or routines which are expanded inline, nested loops where the innermost loop can be vectorized etc
- add total percentage of execution times (Flowtrace list) for subroutines that have been identified as significantly parallel

Method 2 (a more accurate estimate)

execute the program in single CPU mode ie the PROF command which shows what subroutine consumes the most execution time
identify significant sections ie loops of subroutines
add percentage of execution times from PROF for all identified sections of code

Prerequisites for high performance

Vectorization decreases both wall-clock and CPU time, decreases total job times even in a batch environment and it is easier to write code that is vectorizable than to write code in parallel.
Increasingly difficult to vectorize a program beyond 70% to 80% after this autotasking should be utilised in an attempt to further decrease wall-clock time.
Autotasking does not look for areas of parallelism past the scope of the subroutine or function but inlining can eliminate subroutine calls and thus increase the possibility of being able to identify areas that can be run in parallel.

Parallelism and load balancing

The extent of parallelism for any parallel region is determined by the number of partitions or `chunks' of independent work that constitute the region.
Load balancing is the process to ensure the amount of work done by each processor available to the job is approx equal.
The higher the extent of parallelism the easier it is to balance the work evenly across the processors.
Smaller granularity parallelism is easier to balance across available processors but it generates more overhead that large granularity parallelism as synchronization is required each time a `chunk' of work is allocated to a processor.

Overhead

This is the extra execution time created because of the multi-processing process itself ie
- time spent waiting on semaphores - processors have to wait on semaphores for certain lengths of time while synchronizing
- time spent executing extra code for Autotasking - slave processors are acquired upon entering the parallel region and at synchronization points within the parallel region thus incurring overhead, this could add 0% - 5% to the overall execution time
- extra memory bank conflicts - created by inter- and intraprocessor memory references can degrade vector performance
- possible decreased vector performance - due to using inner-loop autotasking ie shorter vector lengths and more vector loop startups.

Memory usage

Several steps can be taken to decrease memory usage:
- use Fortran SAVE statements for large local arrays ie they have static allocation and are no longer used in initial stack size computation
- store large local arrays in COMMON - they again have static memory location
- specify explicit stack requirements for SEGLDR - multitasked code requires more stack space than unitasked code.

Master and slave tasks

Master task - an autotask program executing code that is not autotasked ie
- executes all the serial code,
- initiates parallel processing when an Autotask region is entered,
- performs all or part or none of the work in the Autotasking region and waits until the parallel processing is finished before exiting the Autotasking region.
Slave task-
- master task calls an external function to bring other available slave processors into execution.
- The address of the code to be executed is passed to this function and each slave begins executing at this address.
- The slave code is in a separate subroutine created by FMP and the variables which are shared with the master task and other slave tasks are passed as arguments to the slave subroutine.

Multitasking terminology

Multitasking/Parallel Processing

one program makes use of multiple processors to execute portions of the program simultaneously

Autotasking automatic distribution of loop iterations to multiple processors (or tasks) using cf77 compiler

Parallel region section of code executed by multiple processors - can be classified as partitioned or redundant

Single-threaded code section of code that is executed by only one processor at a time

Serial code section of code that is executed by only one processor

Partitioned code code within a parallel region in which multiple processors share the work that needs to be done, each processor does a different portion of the work

Redundant code code in a parallel region in which processors duplicate the work that needs to be available to all processors

Data dependency when a computation in one iteration of a loop requires a value computed in another iteration of the loop

Synchronization process of coordinating the steps within concurrent/parallel regions

Master task task that executes all of the serial code, initiates parallel processing, and waits until parallel processing is finished before leaving the Autotasking region

Slave task task initiated by the master task

Directives special lines of code beginning with CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$ that give the compiling system information about a program.

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker