parallel instruction execution
vector registers and segmented vector functional units ie pipelining
I/O subsystems or foreground processors
Software Level
concurrent multiprogramming
multiprogramming at the job level
multiprogramming at process level
multitasking
macrotasking - programmers had to modify codes to make use of parallelism ie data scoping and inserting library calls specific to CRI
microtasking - expanded macrotasking requiring less data scoping and compiler directives replaced library calls
autotasking - is the most recent implementation and combines the best features of the previous two with other enhancements
The cf77 command serves as an `overcompiler' in that it invokes the appropriate phases of Autotasking, namely - FPP, FMP, the CFT77 compiler and the loader (as illustrated in the following diagram), to build an executable program based on a set of defaults and options.
define a syntax by which parallelism is expressed, allowing users to guide the Autotasking system in code segments in which the user can provide additional information to the Autotasking system or where the Autotasking system cannot detect parallelism automatically
define the scope of variables when transferring a program to exploit parallelism
provide a simple command line interface to Autotasking
Programs that are heavily vectorized tend to have a high potential for parallelism. High performance for many codes is achieved when the compiler detects code sequences that can be vectorized and uses its vector registers to run those sequences. It costs a program little to use the vector registers hence it is almost always better to run in vector mode.
When it comes to optimization Autotasking favours vectorization over parallel processing. Autotasking vectorizes the innermost loop of a nest of DO loops and runs the outermost on multiple processors, if dependence analysis permits. Autotasking may even process a single vectorizable DO loop in chunks as if it were a nested pair of loops with a vector inner loop and a parallel outer loop.
Loops do not have to be vectorized for Autotasking to detect that a nest of loops can be run in parallel.
Sm = Maximum expected speedup from multitasking (wall-clock time not CPU time)
N = Number of processors available for parallel execution
fp = Fraction of a program that can execute in parallel
fs = Fraction of a program that is serial = 1 - fp
Autotasking may detect and exploit only part or none of the parallelism ie it may not be of sufficient granularity to make the program run faster. Vectorization almost always results in codes running faster whereas Autotasking generally results in speedups but carries a higher risk for slowing down some codes.
eg -Wd"-d" and -Wd"-e"
presence of data dependence
amount of work done within the loop
Loop fusion - combining consecutive loops ie loops which have no statements between them.
Loop rerolling - unrolling iterations of an inner loop into separate statements
Translation of array notation - translates array section syntax into DO loops which can be vectorized and/or autotasked.
Extended parallel regions - tries to combine or expand regions of parallelism to reduce Autotasking run-time overhead.
Parallel cases - when parallelism cannot be found within a loop nest FPP tries to find loops or loop nests that are completely independent of each other and execute them as parallel cases.
Reductions - FPP doesn't autotask loops containing dependencies between loop passes except for the following reduction operations on the elements eg summation, product, minimum, maximum, index of minimum or maximum and dot product. Each task is given a partial reduction to perform and then the partial results are combined as each task finishes.
data dependence analysis (subscript clarification, conditional vectorization, minimization of recursion, loop splitting, and loop peeling)
loop nest restructuring (IF conversion, loop fusion, loop unrolling)
Inline expansion
this produces performance benefits from the expansion of the bodies of certain subroutines and functions into the loops that call them which allows the calling loop and the body of the called routine to be optimized
Scalars in loops
scalar variables that are not modified in the loop do not prohibit optimization
Note - single threaded code segments which must use a single processor exist in each program so although some programs approach 100% parallelism virtually no code reaches 100%.
determine which subroutines consume the majority of program execution eg flowtrace.
identify the subroutines that have significant sections of code that can be executed in parallel ie
heavily vectorized
performance is dominated by nested loops that do not contain CALL statements
performance is dominated by loops with calls to self-contained subroutines or routines which are expanded inline
nested loops where the innermost loop can be vectorized and these loops have a large number of both iterations and operations
programs with loops that do the work of a matrix multiply, a first or second-order recurrence or a search for the index of a minimum or maximum element
add total percentage of execution times (Flowtrace list) for subroutines that have been identified as significantly parallel
METHOD 2 (A MORE ACCURATE ESTIMATE)
execute the program in single CPU mode ie the PROF command which shows what subroutine consumes the most execution time
identify significant sections ie loops of subroutines
add percentage of execution times from PROF for all identified sections of code
It becomes increasingly difficult to vectorize a program beyond 70% to 80% after this autotasking should be utilised in an attempt to further decrease wall-clock time. Autotasking does not look for areas of parallelism past the scope of the subroutine or function but inlining can eliminate subroutine calls and thus increase the possibility of being able to identify areas that can be run in parallel.
The higher the extent of parallelism the easier it is to balance the work evenly across the processors. Smaller granularity parallelism is easier to balance across available processors but it generates more overhead that large granularity parallelism as synchronization is required each time a `chunk' of work is allocated to a processor.
time spent waiting on semaphores - processors have to wait on semapores for certain lengths of time while synchronizing
time spent executing extra code for Autotasking - slave processors are acquired upon entering the parallel region and at synchronization points within the parallel region thus incurring overhead. This could add 0% - 5% to the overall execution time
extra memory bank conflicts - created by inter- and intraprocessor memory references can degrade vector performance
possible decreased vector performance - due to using inner-loop autotasking ie shorter vector lengths and more vector loop startups.
mtdump
ftref
use Fortran SAVE statements for large local arrays ie they have static allocation and are no longer used in initial stack size computation;
store large local arrays in COMMON - they again have static memory location;
specify explicit stack requirements for SEGLDR - multitasked code requires more stack space than unitasked code.
MASTER AND SLAVE TASKS
An autotask program executing code that is not autotasked is called the master task. The master task executes all the serial code, initiates parallel processing when an Autotask region is entered, performs all or part or none of the work in the Autotasking region and waits until the parallel processing is finished before exiting the Autotasking region.
Once Autotasking code is encountered the master task calls an external function to bring other available slave processors into execution. The address of the code to be executed is passed to this function and each slave begins executing at this address. The slave code is in a separate subroutine created by FMP and the variables which are shared with the master task and other slave tasks are passed as arguments to the slave subroutine.
The code executed by the master is distinct from the code executed by the slave task. The master is the original calling routine and contains the initialization and termination code for parallel execution which the slave does not. Also the master code contains a unitasked version of the autotasked code in case the initialization code determines that Autotasking is not appropriate.
use atchop to do a binary search of concurrent regions to narrow the problem down to a suspect loop
run code only through FPP to determine if the problem only arises when FMP is used
if a problem doesn't exist when FMP is not used then set NCPUS to 1 which can help isolate variable scoping problems or pinpoint the problem to either the slave or the master task
start with many FPP options and reduce these one at a time to see which transformation is causing the problems - also disable the default options to isolate the problem
use CFP$ SKIP directive to inhibit transformation on specific loops in the code.
one program makes use of multiple processors to execute portions of the program simultaneously
Autotasking automatic distribution of loop iterations to multiple processors (or tasks) using cf77 compiler
Parallel region section of code executed by multiple processors - can be classified as partitioned or redundant
Single-threaded code section of code that is executed by only one processor at a time
Serial code section of code that is executed by only one processor
Partitioned code code within a parallel region in which multiple processors share the work that needs to be done, each processor does a different portion of the work
Redundant code code in a parallel region in which processors duplicate the work that needs to be available to all processors
Data dependency when a computation in one iteration of a loop requires a value computed in another iteration of the loop
Synchronization process of coordinating the steps within concurrent/parallel regions
Master task task that executes all of the serial code, initiates parallel processing, and waits until parallel processing is finished before leaving the Autotasking region
Slave task task initiated by the master task
Directives special lines of code beginning with CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$ that give the compiling system information about a program.