CRAY Y-MP EL Student Notes Optimizing Fortran Programs

11 Optimizing Fortran Programs - Code Optimization

Having used the information gained via the performance utilities to identify inefficiencies in a program, one or more of the following techniques can be used to improve its performance.

11.1 I/O and memory enhancements

See man assign for a full description and usage. This command may be used a number of ways, most notably to alter the file buffer size for read and write operations. The default file buffer sizes, in 512-word blocks, are as follows:

"Default Buffer Sizes."

Using the -b flag, it is possible to alter buffer sizes to suit your particular needs. For long, sequential reads or writes, a large buffer is in order, if file accesses are short and effectively randomly distributed within the program, it is advisable to reduce the buffer size. The -n flag may be used to set aside system file space for your file, allowing the user to choose the length of stride. Check the effectiveness of these changes with procview.

If procview or job accounting have highlighted inefficient memory transactions, it may be worthwhile checking the code for array accesses in strides of 2k, where k >5 (remember that in Fortran the left hand index changes most rapidly). If this can be altered, then do so.

11.2 Improvement by directive

Compiler directives and command line options are important fine tuning instruments, in the optimization procedure. It is often possible to improve on what cf77 has done by default, without altering your source code (as directives appear as comments to other compilers, we shall consider code with directives added to be unaltered).

User-supplied directives may be to fpp, fmp or cft77, and are as follows (starting in column 1):

CFPP$ DIRECTIVE [SCOPE]

CMIC$ DIRECTIVE

CDIR$ DIRECTIVE

The scope of an fpp directive may be L (loop), R (routine) or F (file). Directives beginning with CMIC@ or CDIR@ are generated by fpp and fmp, and should not be inserted by the user.

Some of the more popular and useful directives are discussed below. For further information on; compiler directives see CF77 Compiling System, Vol 1: Fortran Reference Manual, vectorization directives see Vol 3: Vectorization Guide and tasking directive see Vol 4: Parallel Processing Guide.

11.3 Unconditional vectorization

By default, fpp will not vectorizes a loop with a potential data dependency. The user may know, however, that a data dependency does not occur in practice. For example:

DO 10 I = 1, 10

A(I) = A(I+J)

If we happen to know that J will be greater than 10 at runtime, i.e., there is no dependency, we may insert the `CFPP$ NODEPCHK' directive, and the loop will vectorize. The cft77 directive `CDIR$ IVDEP' has the same effect.

11.4 Code inlining

Often vectorization and/or autotasking are inhibited because of a subroutine call within a loop. In this case, we wish to write the routine explicitly into the loop body. This is called code inlining, and cf77 is kind enough to do most of the hard work for us, as follows.

The `CFPP$ EXPAND subroutine' directive inlines all occurrences of subroutine into current routine. The default is no automatic inlining.

Similarly, the `NEXPAND' directive performs nested expansion, inlining routines called in inlined routines. Bear in mind that fpp may refuse to honour an `NEXPAND' directive if the nesting level is too deep.

The `AUTOEXPAND' directive enables automatic routine inlining, with loop, routine or file scope. Only leaves of the call tree are inlined, and then only provided the routine is smaller than some internal limit. This directive with file scope is equivalent to specifying `cf77 -Wd"-e68" ' on the command line.

The associated `SEARCH (files)' directive is used to tell cf77 where to look for a routine to inline, if it is not in the same file.

11.5 Loop unrolling

Loops with a low iteration count, or a small amount of work in the loop body, may be uneconomical because of overhead. Worse still, an outer loop may be prevented from vectorizing by the presence of an inner loop with only one or two iterations! We combat these effects by unrolling the loops, i.e.,writing the body out explicitly n times, where n is the number of iterations, or an internal limit, whichever is smaller.

The `CFPP$ UNROLL (n)' directive forces these loops to unroll. If used with local scope (default), the optional parameter n is the number of times to unroll that particular loop. If used with routine or file scope, n is the maximum iteration count (default 3) of loops to be unrolled completely.

11.6 Short vector loops

In cases where you know that a vectorizable loop has an iteration count of less than 64, it is advisable to use the `CDIR$ SHORTLOOP' directive, as this cuts down on overhead.

11.7 Autotasking and microtasking directives

The `CMIC$' directives provide a mechanism whereby the user can implement a parallel region where fpp has not recognised the potential for tasking. This subject is too vast to be dealt with here, see CF77 Volume 4: Parallel Processing Guide for an explanation of fmp directives.

11.8 Other directives

There are many more directives than those listed here. Some allow processes such as listings or vectorization to be toggled on/off, others have quite specialised uses, and are irrelevant to those interested in achieving modest optimization in as short a time as possible. See the manual for a full list.

11.9 Changing the code

It may be that even after (or especially after) playing with the compiler options, you are still unhappy with performance, and want to rewrite the code to some extent. This will be very case-dependent, but using these guidelines, combined with a measure of cunning, might reap some reward:

Look first at loops which the fpp listing "almost" vectorized - these are likely to be the areas where some progress is possible.

Are there loops inhibited from vectorization which could be split into vectorizable and nonvectorizable sections? (These are usually spotted by fpp).

Is it possible to dispense with any I/O statements which prevent vectorization or autotasking?

If there are any routines which could be replaced by library routines, it is usually best to do so (again, fpp catches most level 1 and level 2 BLAS). Optimized NAG routines are available, link your code to the library as follows

cf77 -Wl"-l /usr/local/lib/libnag.a" prog.f

For information on specific routines, type naghelp. Remember that on the Cray, single precision = double precision, so we use single precision NAG routines. Thus if the routine name is listed in naghelp as XXXXXF, use XXXXXE.It is sometimes fruitful to use a dummy array to calculate partial results in vector mode, then use these results in a final scalar loop.

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker