CRAY Y-MP EL Performance Utilities, QUB

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

Performance Utilities

`An Overview of the cf77 compiling system'

FPP dependence analyser
- analyses and transforms source code to maximise use of hardware eg enhances vectorization, automatically detects and exploits parallelism
FMP autotasking translation phase
- translates Autotasking directives for cft77
CFT77 actual compiler
- compiles Fortran source code into relocatable object code
SEGLDR
- converts relocatable object code into an executable binary file

Compiler Commands

syntax: - cf77 [options] sourcefile.f

options - specify the invocation of various components

cf77 -Z invokes the preprocessor and the mid processor in the following ways:

-Zp - invokes both fpp and fmp

-Zc - bypasses both fpp and fmp (default option)

-Zv - all phases except fmp

-Zu - all phases except fpp

Example sequence of events invoked by the following command:

cf77 -Zp prog.f -

fpp prog.f > prog.m

fmp prog.m > prog.j

cft77 prog.j

rm prog.m prog.j

segldr prog.o

rm prog.o

Whole system cf77 command

cf77 -Zp -Wd"fpp options" -Wu"fmp options" -Wf"cft77 options" prog.f

cf77 -Zp -o prog compiling system options

-Wd"-ei" dependency analyser options

-Wu"-x" translator options

-Wf"-em" compiler options

-Wl"-i dirs" loader options

file.f file1.f input filenames

Files within the compiling system

*.f = source

*.l = listing

*.s = CAL ie Cray Assembly Language

*.o = compiled code

*.m = code containing Autotasking/microtasking directives

*.j = Fortran code previously processed by fmp

*.a = library files

*.F = code to be processed by Generic preprocessor GPP

Compiler features

INLINING - Inline code expansion
- Explicit or automatic.
- Subprogram is incorporated (within the resulting binary program) into the calling program unit where the call occurred.
- Eliminates the calling overhead and allows vectorization of loops which would otherwise be prevented from vectorizing due to the external call.
- A problem with inlining is that it may cause an unacceptable increase in the size of the program.
CROSS-COMPILING
- This is compiling a program on one system to execute on another.
- The target system is specified by cpu and hdw arguments either on the complier line (using -C) or directly to the operating system (using the TARGET environment variable).

Compiler directives

Compiler directives within source code
- specify actions to be performed by the compiler - they are not Fortran code
Categories of compiler directives
- vectorization control, scalar optimization control, istable output control, localised use of features specified by command line options and storage specifications.
FPP and FMP generated directives
- used by the compiler beginning with CDIR@ whereas the user may use CDIR$.
Directives in the source program
- apply only to the program unit in which they appear
- directives used on the command line apply to the entire compilation. Generally the CDIR$ directives for features override the command-line option for the same feature.

Loop unwinding and unrolling

Both of these optimization techniques are performed automatically by the compiler and can reduce loop overhead for eligible loops.
- Unwinding makes several copies of the body of a loop resulting in straight code which is then vectorizable.
- Unrolling creates a new version of the loop at the scheduling level not in the source but does not remove the original. The new copy consists of n copies of the loops computations ie n iterations. This reduces loop overhead and improves scheduling and register assignment in the new unrolled loop.

Optimization Strategies

Preliminary considerations
- is optimization worthwhile/needed
- does the program use significant resources and which ones
- is the program destined for frequent long-term use
- how can one get the maximum return (execution speedup) on investment (your time)
- learning experience to apply to development phase of future programming
Step 0 Define a baseline
- gather execution performance statistics for properly executing code ie job accounting and procstat utilities give information which should be retained and compared with the results obtained after optimization efforts
- maintain separately a working version of the program
- do not begin optimization until properly executing code has been obtained
- operate on a computationally short model which is representative of the whole
Step 1 Determine resource target for optimization
- use job accounting, ja, and procstat
- determine if the program is I/O bound
- determine if the program is vectorized
- determine if the program efficiently utilises memory
- compare loopmark listings - run preprocessor to determine additional speedup eg cf77 -Zv -Wf"-em" *.f
- verify the results produced by fpp
Step 2 Target subroutines for optimization
- loopmark, prof, perftrace
- check loopmark for vectorization information -as it is often the single most effective method of reducing overall program execution time
Step 3 Determine target loops for vectorization
- run prof
- look for high percentage loops or regions
- look for high percentage regions which are not included in a vectorized inner loop
- look for high percentage regions which are vectorized but include complex index calculations and/or divisions
- look for high percentage regions which are vectorized, but include IF statements, and/or GOTO branches
Step 4 Recall vector inhibitors
- I/O statements
- CALL, RETURN, STOP, or PAUSE statements
- function references
- branching statements (IF, GOTO)
- data dependencies and recurrences
- ambiguous subscript references
- long loops - could use agress option of Wf but have to verify results eg cf77 -Wf"-o aggress" *.f
Step 5 Other Optimization Techniques
- unroll small loops
- inline often called subroutines
- promote scalars to vectors
- use parameter statements to define constants especially those which define loop lengths
- switch loop dimensions when appropriate
Step 6 Other Information
- regularly run the performance tools since an optimized routine can cause another to become important
- perform optimization comparisons on a short run

Optimising Fortran Programs - Code Analysis

Types of optimization tools

Static Analysis Tools - give compile time information
- cross reference listings (cf77 -Wf"-ex" prog.f = creates the prog.l file)
- program listings and utilities - Loopmark listing and ftref
Dynamic Analysis Tools - give run-time information
- ja - job accounting
- flowtrace - subroutine level analysis of the program giving the dynamic calling tree
- prof and profview - use a fixed time interval for sampling so resulting statistics involve probability
- procstat and procrpt

Optimization Techniques

Job Accounting
- shows what happens during code execution.
- Incurs no CPU overhead. Gives useful information such as a timestamp of the beginning and end of runtime, total elapsed time, CPU time used, and time spent in I/O operations and memory accesses.
- If NCPUS, is greater than 1 (default 3, or 4) and autotasking is turned on (-Zp flag to cf77), ja reports on the tasking breakdown, showing how many CPUs were used and for how long.
- Use the following:

a.out

ja -clst > acctfile

cl = command summary, s = summary, t = terminate
Job Accounting - on the basis of this information decide which area of the program to concentrate on first.
- If I/O or memory activity is high, the next step should be to investigate this further with procview.
- If the code is CPU-bound, it's probably best to move on to profview and flowview, to further examine CPU activity.

Example - Job Accounting

Job Accounting - Summary Report

===============================

Job Accounting File Name : /tmp/nqs.+++++03av/.jacct4001

Operating System : sn5227 sn5227 7.0.4.2 roo.5 CRAY Y-MP

User Name (ID) : ccg0002 (1474)

Group Name (ID) : cc (901)

Account Name (ID) : centre (13)

Job Name (ID) : scrip (4001)

Report Starts : 10/11/94 13:40:11

Report Ends : 10/11/94 13:40:18

Elapsed Time : 7 Seconds

User CPU Time : 12.4072 [ 12.4067] Seconds

Multitasking Breakdown

(Concurrent CPUs * Connect seconds = CPU seconds)

--------------- --------------- -----------

1 * 0.1625 = 0.1625

2 * 1.9370 = 3.8740

3 * 2.4148 = 7.2445

(Concurrent CPUs * Connect seconds = CPU seconds)

(Avg.) (total) (total)

-------------------------------------------

2.50 * 4.5143 = 11.2809

System CPU Time : 0.7885 Seconds

I/O Wait Time (Locked) : 1.1013 Seconds

I/O Wait Time (Unlocked) : 0.2744 Seconds

CPU Time Memory Integral : 0.8239 Mword-seconds

SDS Time Memory Integral : 0.0000 Mword-seconds

I/O Wait Time Memory Integral : 0.1312 Mword-seconds

Data Transferred : 0.2458 MWords

Maximum memory used : 0.1406 MWords

Logical I/O Requests : 122

Physical I/O Requests : 97

Number of Commands : 2

Billing Units : 00

Optimization Techniques

Procview
- produces information about process activity working with standard UNICOS libraries ie statistics about I/O activity, process activity and memory activity.
- Use the following commands:

setenv NCPUS 1

procstat -R proc.raw a.out

procview -L -Sn proc.raw>proc.report

-R = writes rawfile for procview to create reports, -L = prevents execution of interactive X Windows, -Sn = generates one or more system report files sorted by name
autotasking is not allowed (NCPUS=1).
Recompilation and reloading are not necessary
There is very little CPU overhead
procview report shows,
- how many bytes were read and written in each file
- the rate at which they were processed.
- With this knowledge, we can use the assign statement (explained later) to improve the efficiency of I/O activity.rawfile which procview uses to create various report

Example - Procstat

PROCSTAT FILE REPORT

Showing All User Files That Moved Data

(Sorted by File Name)

(Process ID is: 94616)

I/O Max File Bytes Avg I/O Rate

Filename Mode Size Processed (Megabytes/Sec)

-------------------------------------------------------

stderr WO 0 37 0.0 12

stdout WO 0 2649 0.012

===============================================

Total Files = 2 0 2786 0.012

Profview

A statistical analysis of program execution. The program address register is sampled at regular intervals, and a system routine records the address of the instruction currently being executed.

cf77 -Wf"-ez" -l prof prog.f

setenv PROF_WPB=1

a.out

prof -x a.out > prof.raw

profview prof.raw OR

profview -LmhDc prof.raw > prof.report

requires recompilation and reloading, and uses extra memory. CPU overhead is negligible.
reliant on the laws of probability and may be inaccurate.
good early indicator of execution "hot-spots".

Example - Profview

Flowview

monitors subroutine activity by logging routine entry and exit times
shows how much time each routine takes
number of times each routine was called, and the caller/callee relationships

cf77 -F prog.f

a.out

flowview -Luch > flow.report

recompilation and reloading are needed
incurs CPU overhead (may be significant).
provides the in-line factor for each routine.
one of the clearest indicators of program performance.

Example - Flowview

The arcane world of listing

fpp - the dependency analysis phase
- does a significant amount of code restructuring, as well as inserting directives for autotasking and vectorization. It also produces a listing, detailing what it has done, and why it failed to do more eg cf77 -Zp -Wd"-l listfile" prog.f OR fpp -l listfile prog.f
cft77
- similar to that produced by fpp, with the omission of any indication of autotasking eg cf77 -Wf"-e m" prog.f OR cft77 -e m prog.f
xbrowse (incorporating atscope)
- a FORTRAN-specific editor and debugging aid
- atscope utility - helps the user decide if variables are of private or shared scope in a parallel region, and inserts autotasking directives for you.
- Useful for those who wish to improve on fpp's efforts at autotasking eg xbrowse &
ftref
- generates a static call tree showing which routines call which and other cross-referencing information from a listing file (ie before execution) eg

cf77 -Wf"-esx" -c progn.f

ftref -c full -tfull progn.l > progn.xref

Jumpview

the only way to obtain megaflops ratings on the Y-MP EL
determines the exact timing of every executed code block within your program. eg

cf77 -Wf"-ez" -ltrace prog.f

jt a.out

jumpview -Lumch > jump.report

requires recompilation and reloading
incurs significant CPU overhead
Jumptracing timings are exact and reproducible, in contrast to the operating system timings in Flowtracing and probabilistic timings in Profiling.
unlike flowview, jumpview gives times for library routines
helps show the ratio of vector to scalar operations in each subroutine. Combined with the megaflops rating, this gives what is probably the clearest indicator of vector performance.

Example - Jumpview

Autotasking analysis - Atexpert

atexpert
- This utility gauges the effectiveness of autotasking, by predicting the speedup in a dedicated environment of up to eight processors eg

cf77 -Zp -Wu"-p" prog.f

a.out

atexpert -L -rps -o -f atx.raw > atx.report

Recompilation and reloading are required
some CPU overhead
atexpert works with multitasked codes
strength lies in its ability to provide observations on the code ie gives suggestions as to how best the code may be improved - often vague

Example - atexpert

mtdump

The mtdump command examines the unformatted dump of the multitasking history buffer and generates reports according to the options used.
The only required argument is file eg mtdump filename

XPROC and XFM

xproc, process monitor, xfm, file manager, have been enhanced in keeping with the CrayLook
xproc monitors both interactive processes and batch jobs submitted through the UNICOS Network Queuing System (NQS)
xfm provides a graphic display of UNICOS files and a simplified method of file management:
- helps you move, copy, and delete files
- can edit files and perform other file functions without having to learn use UNICOS command-line options.

Optimising Fortran Programs - Code Optimization

Techniques to improve performance of inefficient code.

I/O and memory enhancements
- assign command may be used a number of ways eg to alter the file buffer size for read and write operations.
- The default file buffer sizes, in 512-word blocks, are as follows:

Using the -b flag, it is possible to alter buffer sizes to suit your particular needs ie long, sequential reads or writes use a large buffer, if file accesses are short and effectively randomly distributed within the program reduce the buffer size
The -n flag may be used to set aside system file space for your file, allowing the user to choose the length of stride
Improvement by directive
- improve on what cf77 has done by default, without altering your source code (as directives appear as comments to other compilers we consider code with directives added to be unaltered)
- User-supplied directives may be to fpp, fmp or cft77, and are as follows (starting in column 1)

CFPP$ DIRECTIVE [SCOPE]

CMIC$ DIRECTIVE

CDIR$ DIRECTIVE

Unconditional vectorization
- By default, fpp will not vectorizes a loop with a potential data dependency. The user may know, however, that a data dependency does not occur in practice. For example:

DO 10 I = 1, 10

A(I) = A(I+J)

10 CONTINUE

If we happen to know that J >10 at runtime there is no dependency, we may insert the `CFPP$ NODEPCHK' directive, and the loop will vectorize. The cft77 directive `CDIR$ IVDEP' has the same effect.
Code inlining
- write a routine explicitly into the loop body
- `CFPP$ EXPAND subroutine' directive inlines all occurrences of subroutine into current routine
- `NEXPAND' directive performs nested expansion, inlining routines called in inlined routines
- `AUTOEXPAND' directive enables automatic routine inlining, with loop, routine or file scope.
- `SEARCH (files)' directive is used to tell cf77 where to look for a routine to inline, if it is not in the same file.
Loop unrolling
- writing the body out explicitly n times, where n is the number of iterations, or an internal limit, whichever is smaller.
- `CFPP$ UNROLL (n)' directive forces these loops to unroll.
Short vector loops
- In cases where you know that a vectorizable loop has an iteration count of less than 64, it is advisable to use the `CDIR$ SHORTLOOP' directive, as this cuts down on overhead.
Autotasking and microtasking directives
- The `CMIC$' directives provide a mechanism whereby the user can implement a parallel region where fpp has not recognised the potential for tasking.
Other directives
- eg allow processes such as listings or vectorization to be toggled on/off

Optimising Fortran Programs - Changing the code

If the performance is unsatisfactory you may want to rewrite the code to some extent.
Some guidelines to follow are:
- Look first at loops which the fpp listing "almost" vectorized ie most likely areas to make progress
- Are there loops inhibited from vectorization which could be split into vectorizable and nonvectorizable sections?
- Is it possible to dispense with any I/O statements which prevent vectorization or autotasking?
- If there are any routines which could be replaced by library routines, it is usually best to do so (again, fpp catches most level 1 and level 2 BLAS). Optimized NAG routines are available, link your code to the library as follows
- cf77 -Wl"-l /usr/local/lib/libnag.a" prog.f

TIMING CODE

STOP - include this command prior to the program's END to produce statistics thus:
- Id - identification number
- CP - actual CPU time used by the program
- WALL CLOCK TIME - the elapsed real time
- % of xCPUS - used total % of total CPU time
SECOND - returns the elapsed CPU time (a real number in seconds) since the start of a program, including time accumulated by all processes in a multitasking program. Eg

BEFORE = SECOND( )

CALL DOWORK( )

AFTER = SECOND( )

CPUTIME = AFTER - BEFORE

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker