Tips and tricks on improving Fortran code performance

As part of my Ph.D. research, I am working on development of numerical models of atmosphere and ocean circulation. These involve numerically solving systems of PDE's on the order of ~10^6 grid points, over ~10^4 time steps. Thus, a typical model simulation takes hours to a few days to complete when run in MPI on dozens of CPUs. Naturally, improving model efficiency as much as possible is important, while making sure the results are byte-to-byte identical.

While I feel quite comfortable with my Fortran programming, and am aware of quite some tricks to make code more efficient, I feel like there is still space to improve, and tricks that I am not aware of.

Currently, I make sure I use as few divisions as possible, and try not to use literal constants (I was taught to do this from very early on, e.g. use half=0.5 instead of 0.5 in actual computations), use as few transcendental functions as possible etc.

What other performance sensitive factors are there? At the moment, I am wondering about a few:

1) Does the order of mathematical operations matter? For example if I have:

a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c

would d evaluate with different efficiency based on the order of multiplication? Nowadays, this must be compiler specific, but is there a straight answer? I notice d getting (slightly) different value based on the order (precision limit), but will this impact the efficiency or not?

2) Passing lots (e.g. dozens) of arrays as arguments to a subroutine versus accessing these arrays from a module within the subroutine?

3) Fortran 95 constructs (FORALL and WHERE) versus DO and IF? I know that these mattered back in the 90's when code vectorization was a big thing, but is there any difference now with modern compilers being able to vectorize explicit DO loops? (I am using PGI, Intel, and IBM compilers in my work)

4) Raising a number to an integer power versus multiplication? E.g.:

b=a**4

b=a*a*a*a

I have been taught to always use the latter where possible. Does this affect efficiency and/or precision? (probably compiler dependent as well)

Please discuss and/or add any tricks and tips that you know about improving Fortran code efficiency. What else is out there? If you know anything specific to what each of the compilers above do related to this question, please include that as well.

Added: Note that I do not have any bottlenecks or performance issues per se. I am asking if there are any general rules for optimizing the code in sense of operations.

Thanks!

Solution

You've got a-priori ideas about what to do, and some of them might actually help, but the biggest payoff is in a-posteriori anaylsis.
(Added: In other words, getting a*b*c into a different order might save a couple cycles (which I doubt), while at the same time you don't know you're not getting blind-sided by something spending 1000 cycles for no good reason.)

No matter how carefully you code it, there will be opportunities for speedup that you didn't foresee. Here's how I find them. (Some people consider this method controversial).

It's best to start with optimization flags OFF when you do this, so the code isn't all scrambled. Later you can turn them on and let the compiler do its thing.

Get it running under a debugger with enough of a workload so it runs for a reasonable length of time. While it's running, manually interrupt it, and take a good hard look at what it's doing and why. Do this several times, like 10, so you don't draw erroneous conclusions about what it's spending time at.

Here's examples of things you might find:

It could be spending a large fraction of time calling math library functions unnecessarily due to the way some expressions were coded, or with the same argument values as in prior calls.
It could be spending a large fraction of time doing some file I/O, or opening/closing a file, deep inside some routine that seemed harmless to call.
It could be in a general-purpose library function, calling a subordinate subroutine, for the purpose of checking argument flags to the upper function. In such a case, much of that time might be eliminated by writing a special-purpose function and calling that instead.

If you do this entire operation two or three times, you will have removed the stupid stuff that finds its way into any software when it's first written. After that, you can turn on the optimization, parallelism, or whatever, and be confident no time is being spent on silly stuff.