In this section we had a close look at two broad classes of algorithms.
The first class is represented by algorithms we've implemented in order to evaluate , , and y' = a + b x. These computations are outlined in Figure 2.10 (page ). Their characteristic feature is the use of array operations and a total lack of explicit iterations, i.e., no DO loop, even though in other languages, like C or Lisp, we would have to use many iterative constructs in order to perform the same computations. These operations are explicitly parallel. This means that if we were to run them on a parallel computer then
There is no point whatsoever in parallelising small programs or programs working on small data sets. The effort involved in parallelising such programs would never pay off.
The second class is represented by algorithms we've used to calculate incomplete gamma functions. You can see these algorithms implemented in Figures 2.12 (page ) and 2.13 (page ). In both cases the algorithms are recursive, i.e., we have to enter a DO loop within which we compute successive terms using terms evaluated in previous iterations. Such algorithms are basically sequential. They cannot be parallelised, although small optimisations may be possible sometimes. The only way to make them run fast, is to have a very fast sequential processor that is clocked at a very high clock rate.
In case of the algorithm shown in Firgure 2.13 on page a superscalar processor or an explicitly parallel instruction
set CPU such as Merced can speed up things just a little bit. For example
a0 in the first line of the
do loop can be evaluated
at the same time as
b0 in the second line. Similarly
lines 3 and 4 of the
do loop can be evaluated simultaneously
too. Lines 5 through 7 must then be evaluated in sequence after
we have calculated
b1 in line 4. But the last, 8th, line,
n can be evaluated at the same time as lines 5 through 7.
In summary, instead of 8 operations per loop, we can
do it in 5 parallel steps. The speed up could be of the order of
1.5 to 2 against a purely sequential processor.
There are other things that you can optimise when developing algorithms of this kind. You should try to minimise data movement in the first place. For example if you have an auxiliary variable, which is used to memorise an old value of some quantity to be then used in a next iteration, make that variable reside on the CPU itself, in a register. A good compiler should make such optimisations automatically. Sometimes you may have to use special compiler directives.
But a more important optimisation is in
the choice of an algorithm itself: use an algorithm that converges
quickly, so that you don't have to run many iterations. Continued
fraction expansions discussed in Section 2.2.3
(page ) are often very effective in this
respect. But you must watch for regions of validity: make sure that
you don't divide by zero, nor do you get close to a division by zero.
You can usually switch to a safer procedure in such a region,
as we have done in function
goodness (cf. Figure 2.11,