The Queen's University of Belfast

Parallel Computer Centre
[Next] [Previous] [Top]
Performance Utilities
Performance Utilities
`An Overview of the cf77 compiling system'

- FPP dependence analyser
- analyses and transforms source code to maximise use of hardware eg enhances vectorization, automatically detects and exploits parallelism
- FMP autotasking translation phase
- translates Autotasking directives for cft77
- CFT77 actual compiler
- compiles Fortran source code into relocatable object code
- SEGLDR
- converts relocatable object code into an executable binary file
Compiler Commands
syntax: - cf77 [options] sourcefile.f
options - specify the invocation of various components
cf77 -Z invokes the preprocessor and the mid processor in the following ways:
-Zp - invokes both fpp and fmp
-Zc - bypasses both fpp and fmp (default option)
-Zv - all phases except fmp
-Zu - all phases except fpp
Example sequence of events invoked by the following command:
cf77 -Zp prog.f -
fpp prog.f > prog.m
fmp prog.m > prog.j
cft77 prog.j
rm prog.m prog.j
segldr prog.o
rm prog.o
Whole system cf77 command
cf77 -Zp -Wd"fpp options" -Wu"fmp options" -Wf"cft77 options" prog.f
cf77 -Zp -o prog compiling system options
-Wd"-ei" dependency analyser options
-Wu"-x" translator options
-Wf"-em" compiler options
-Wl"-i dirs" loader options
file.f file1.f input filenames
Files within the compiling system
*.f = source
*.l = listing
*.s = CAL ie Cray Assembly Language
*.o = compiled code
*.m = code containing Autotasking/microtasking directives
*.j = Fortran code previously processed by fmp
*.a = library files
*.F = code to be processed by Generic preprocessor GPP
Compiler features
- INLINING - Inline code expansion
- Explicit or automatic.
- Subprogram is incorporated (within the resulting binary program) into the calling program unit where the call occurred.
- Eliminates the calling overhead and allows vectorization of loops which would otherwise be prevented from vectorizing due to the external call.
- A problem with inlining is that it may cause an unacceptable increase in the size of the program.
- CROSS-COMPILING
- This is compiling a program on one system to execute on another.
- The target system is specified by cpu and hdw arguments either on the complier line (using -C) or directly to the operating system (using the TARGET environment variable).
Compiler directives
- Compiler directives within source code
- specify actions to be performed by the compiler - they are not Fortran code
- Categories of compiler directives
- vectorization control, scalar optimization control, istable output control, localised use of features specified by command line options and storage specifications.
- FPP and FMP generated directives
- used by the compiler beginning with CDIR@ whereas the user may use CDIR$.
- Directives in the source program
- apply only to the program unit in which they appear
- directives used on the command line apply to the entire compilation. Generally the CDIR$ directives for features override the command-line option for the same feature.
Loop unwinding and unrolling
- Both of these optimization techniques are performed automatically by the compiler and can reduce loop overhead for eligible loops.
- Unwinding makes several copies of the body of a loop resulting in straight code which is then vectorizable.
- Unrolling creates a new version of the loop at the scheduling level not in the source but does not remove the original. The new copy consists of n copies of the loops computations ie n iterations. This reduces loop overhead and improves scheduling and register assignment in the new unrolled loop.
Optimization Strategies
- Preliminary considerations
- is optimization worthwhile/needed
- does the program use significant resources and which ones
- is the program destined for frequent long-term use
- how can one get the maximum return (execution speedup) on investment (your time)
- learning experience to apply to development phase of future programming
- Step 0 Define a baseline
- gather execution performance statistics for properly executing code ie job accounting and procstat utilities give information which should be retained and compared with the results obtained after optimization efforts
- maintain separately a working version of the program
- do not begin optimization until properly executing code has been obtained
- operate on a computationally short model which is representative of the whole
- Step 1 Determine resource target for optimization
- use job accounting, ja, and procstat
- determine if the program is I/O bound
- determine if the program is vectorized
- determine if the program efficiently utilises memory
- compare loopmark listings - run preprocessor to determine additional speedup eg cf77 -Zv -Wf"-em" *.f
- verify the results produced by fpp
- Step 2 Target subroutines for optimization
- loopmark, prof, perftrace
- check loopmark for vectorization information -as it is often the single most effective method of reducing overall program execution time
- Step 3 Determine target loops for vectorization
- run prof
- look for high percentage loops or regions
- look for high percentage regions which are not included in a vectorized inner loop
- look for high percentage regions which are vectorized but include complex index calculations and/or divisions
- look for high percentage regions which are vectorized, but include IF statements, and/or GOTO branches
- Step 4 Recall vector inhibitors
- I/O statements
- CALL, RETURN, STOP, or PAUSE statements
- function references
- branching statements (IF, GOTO)
- data dependencies and recurrences
- ambiguous subscript references
- long loops - could use agress option of Wf but have to verify results eg cf77 -Wf"-o aggress" *.f
- Step 5 Other Optimization Techniques
- unroll small loops
- inline often called subroutines
- promote scalars to vectors
- use parameter statements to define constants especially those which define loop lengths
- switch loop dimensions when appropriate
- Step 6 Other Information
- regularly run the performance tools since an optimized routine can cause another to become important
- perform optimization comparisons on a short run
Optimising Fortran Programs - Code Analysis

Types of optimization tools
- Static Analysis Tools - give compile time information
- cross reference listings (cf77 -Wf"-ex" prog.f = creates the prog.l file)
- program listings and utilities - Loopmark listing and ftref
- Dynamic Analysis Tools - give run-time information
- ja - job accounting
- flowtrace - subroutine level analysis of the program giving the dynamic calling tree
- prof and profview - use a fixed time interval for sampling so resulting statistics involve probability
- procstat and procrpt
Optimization Techniques
- Job Accounting
- shows what happens during code execution.
- Incurs no CPU overhead. Gives useful information such as a timestamp of the beginning and end of runtime, total elapsed time, CPU time used, and time spent in I/O operations and memory accesses.
- If NCPUS, is greater than 1 (default 3, or 4) and autotasking is turned on (-Zp flag to cf77), ja reports on the tasking breakdown, showing how many CPUs were used and for how long.
- Use the following:
ja
a.out
ja -clst > acctfile
- cl = command summary, s = summary, t = terminate
- Job Accounting - on the basis of this information decide which area of the program to concentrate on first.
- If I/O or memory activity is high, the next step should be to investigate this further with procview.
- If the code is CPU-bound, it's probably best to move on to profview and flowview, to further examine CPU activity.
Example - Job Accounting
Job Accounting - Summary Report
===============================
Job Accounting File Name : /tmp/nqs.+++++03av/.jacct4001
Operating System : sn5227 sn5227 7.0.4.2 roo.5 CRAY Y-MP
User Name (ID) : ccg0002 (1474)
Group Name (ID) : cc (901)
Account Name (ID) : centre (13)
Job Name (ID) : scrip (4001)
Report Starts : 10/11/94 13:40:11
Report Ends : 10/11/94 13:40:18
Elapsed Time : 7 Seconds
User CPU Time : 12.4072 [ 12.4067] Seconds
Multitasking Breakdown
(Concurrent CPUs * Connect seconds = CPU seconds)
--------------- --------------- -----------
1 * 0.1625 = 0.1625
2 * 1.9370 = 3.8740
3 * 2.4148 = 7.2445
(Concurrent CPUs * Connect seconds = CPU seconds)
(Avg.) (total) (total)
-------------------------------------------
2.50 * 4.5143 = 11.2809
System CPU Time : 0.7885 Seconds
I/O Wait Time (Locked) : 1.1013 Seconds
I/O Wait Time (Unlocked) : 0.2744 Seconds
CPU Time Memory Integral : 0.8239 Mword-seconds
SDS Time Memory Integral : 0.0000 Mword-seconds
I/O Wait Time Memory Integral : 0.1312 Mword-seconds
Data Transferred : 0.2458 MWords
Maximum memory used : 0.1406 MWords
Logical I/O Requests : 122
Physical I/O Requests : 97
Number of Commands : 2
Billing Units : 00
Optimization Techniques
- Procview
- produces information about process activity working with standard UNICOS libraries ie statistics about I/O activity, process activity and memory activity.
- Use the following commands:
setenv NCPUS 1
procstat -R proc.raw a.out
procview -L -Sn proc.raw>proc.report
- -R = writes rawfile for procview to create reports, -L = prevents execution of interactive X Windows, -Sn = generates one or more system report files sorted by name
- autotasking is not allowed (NCPUS=1).
- Recompilation and reloading are not necessary
- There is very little CPU overhead
- procview report shows,
- how many bytes were read and written in each file
- the rate at which they were processed.
- With this knowledge, we can use the assign statement (explained later) to improve the efficiency of I/O activity.rawfile which procview uses to create various report
Example - Procstat
PROCSTAT FILE REPORT
Showing All User Files That Moved Data
(Sorted by File Name)
(Process ID is: 94616)
I/O Max File Bytes Avg I/O Rate
Filename Mode Size Processed (Megabytes/Sec)
-------------------------------------------------------
stderr WO 0 37 0.0 12
stdout WO 0 2649 0.012
===============================================
Total Files = 2 0 2786 0.012
%
Profview
- A statistical analysis of program execution. The program address register is sampled at regular intervals, and a system routine records the address of the instruction currently being executed.
cf77 -Wf"-ez" -l prof prog.f
setenv PROF_WPB=1
a.out
prof -x a.out > prof.raw
profview prof.raw OR
profview -LmhDc prof.raw > prof.report
- requires recompilation and reloading, and uses extra memory. CPU overhead is negligible.
- reliant on the laws of probability and may be inaccurate.
- good early indicator of execution "hot-spots".
Example - Profview

Example - Profview

Flowview
- monitors subroutine activity by logging routine entry and exit times
- shows how much time each routine takes
- number of times each routine was called, and the caller/callee relationships
cf77 -F prog.f
a.out
flowview -Luch > flow.report
- recompilation and reloading are needed
- incurs CPU overhead (may be significant).
- provides the in-line factor for each routine.
- one of the clearest indicators of program performance.
Example - Flowview

Example - Flowview

The arcane world of listing
- fpp - the dependency analysis phase
- does a significant amount of code restructuring, as well as inserting directives for autotasking and vectorization. It also produces a listing, detailing what it has done, and why it failed to do more eg cf77 -Zp -Wd"-l listfile" prog.f OR fpp -l listfile prog.f
- cft77
- similar to that produced by fpp, with the omission of any indication of autotasking eg cf77 -Wf"-e m" prog.f OR cft77 -e m prog.f
- xbrowse (incorporating atscope)
- a FORTRAN-specific editor and debugging aid
- atscope utility - helps the user decide if variables are of private or shared scope in a parallel region, and inserts autotasking directives for you.
- Useful for those who wish to improve on fpp's efforts at autotasking eg xbrowse &
- ftref
- generates a static call tree showing which routines call which and other cross-referencing information from a listing file (ie before execution) eg
cf77 -Wf"-esx" -c progn.f
ftref -c full -tfull progn.l > progn.xref
Jumpview
- the only way to obtain megaflops ratings on the Y-MP EL
- determines the exact timing of every executed code block within your program. eg
cf77 -Wf"-ez" -ltrace prog.f
jt a.out
jumpview -Lumch > jump.report
- requires recompilation and reloading
- incurs significant CPU overhead
- Jumptracing timings are exact and reproducible, in contrast to the operating system timings in Flowtracing and probabilistic timings in Profiling.
- unlike flowview, jumpview gives times for library routines
- helps show the ratio of vector to scalar operations in each subroutine. Combined with the megaflops rating, this gives what is probably the clearest indicator of vector performance.
Example - Jumpview

Example - Jumpview

Autotasking analysis - Atexpert
- atexpert
- This utility gauges the effectiveness of autotasking, by predicting the speedup in a dedicated environment of up to eight processors eg
cf77 -Zp -Wu"-p" prog.f
a.out
atexpert -L -rps -o -f atx.raw > atx.report
- Recompilation and reloading are required
- some CPU overhead
- atexpert works with multitasked codes
- strength lies in its ability to provide observations on the code ie gives suggestions as to how best the code may be improved - often vague
Example - atexpert

Example - atexpert

mtdump
- The mtdump command examines the unformatted dump of the multitasking history buffer and generates reports according to the options used.
- The only required argument is file eg mtdump filename
XPROC and XFM
- xproc, process monitor, xfm, file manager, have been enhanced in keeping with the CrayLook
- xproc monitors both interactive processes and batch jobs submitted through the UNICOS Network Queuing System (NQS)
- xfm provides a graphic display of UNICOS files and a simplified method of file management:
- helps you move, copy, and delete files
- can edit files and perform other file functions without having to learn use UNICOS command-line options.
Optimising Fortran Programs - Code Optimization
Techniques to improve performance of inefficient code.
- I/O and memory enhancements
- assign command may be used a number of ways eg to alter the file buffer size for read and write operations.
- The default file buffer sizes, in 512-word blocks, are as follows:

- Using the -b flag, it is possible to alter buffer sizes to suit your particular needs ie long, sequential reads or writes use a large buffer, if file accesses are short and effectively randomly distributed within the program reduce the buffer size
- The -n flag may be used to set aside system file space for your file, allowing the user to choose the length of stride
- Improvement by directive
- improve on what cf77 has done by default, without altering your source code (as directives appear as comments to other compilers we consider code with directives added to be unaltered)
- User-supplied directives may be to fpp, fmp or cft77, and are as follows (starting in column 1)
CFPP$ DIRECTIVE [SCOPE]
CMIC$ DIRECTIVE
CDIR$ DIRECTIVE
- Unconditional vectorization
- By default, fpp will not vectorizes a loop with a potential data dependency. The user may know, however, that a data dependency does not occur in practice. For example:
DO 10 I = 1, 10
A(I) = A(I+J)
10 CONTINUE
- If we happen to know that J >10 at runtime there is no dependency, we may insert the `CFPP$ NODEPCHK' directive, and the loop will vectorize. The cft77 directive `CDIR$ IVDEP' has the same effect.
- Code inlining
- write a routine explicitly into the loop body
- `CFPP$ EXPAND subroutine' directive inlines all occurrences of subroutine into current routine
- `NEXPAND' directive performs nested expansion, inlining routines called in inlined routines
- `AUTOEXPAND' directive enables automatic routine inlining, with loop, routine or file scope.
- `SEARCH (files)' directive is used to tell cf77 where to look for a routine to inline, if it is not in the same file.
- Loop unrolling
- writing the body out explicitly n times, where n is the number of iterations, or an internal limit, whichever is smaller.
- `CFPP$ UNROLL (n)' directive forces these loops to unroll.
- Short vector loops
- In cases where you know that a vectorizable loop has an iteration count of less than 64, it is advisable to use the `CDIR$ SHORTLOOP' directive, as this cuts down on overhead.
- Autotasking and microtasking directives
- The `CMIC$' directives provide a mechanism whereby the user can implement a parallel region where fpp has not recognised the potential for tasking.
- Other directives
- eg allow processes such as listings or vectorization to be toggled on/off
Optimising Fortran Programs - Changing the code
- If the performance is unsatisfactory you may want to rewrite the code to some extent.
- Some guidelines to follow are:
- Look first at loops which the fpp listing "almost" vectorized ie most likely areas to make progress
- Are there loops inhibited from vectorization which could be split into vectorizable and nonvectorizable sections?
- Is it possible to dispense with any I/O statements which prevent vectorization or autotasking?
- If there are any routines which could be replaced by library routines, it is usually best to do so (again, fpp catches most level 1 and level 2 BLAS). Optimized NAG routines are available, link your code to the library as follows
- cf77 -Wl"-l /usr/local/lib/libnag.a" prog.f
TIMING CODE
- STOP - include this command prior to the program's END to produce statistics thus:
- Id - identification number
- CP - actual CPU time used by the program
- WALL CLOCK TIME - the elapsed real time
- % of xCPUS - used total % of total CPU time
- SECOND - returns the elapsed CPU time (a real number in seconds) since the start of a program, including time accumulated by all processes in a multitasking program. Eg
BEFORE = SECOND( )
CALL DOWORK( )
AFTER = SECOND( )
CPUTIME = AFTER - BEFORE
[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker