expand_temp_profile is very relaxed when it
comes to checking the transform length, especially against the number
of nodes used by your HPF program, subroutine
goes about it quite meticulously. This is an overkill in this program
though. Observe that
expand_temp_profile is called before
get_diffusion_matrix. Consequently, if transform length is
not a multiple of the number of nodes, we're going to bump out thanks
expand_temp_profile well before we get to all
that checking in
get_diffusion_matrix. But, at least,
this subroutine shows how to do that, if you really have to.
get_diffusion_matrix calls subroutine
on entry. Subroutine
factor_nodes, in turn, checks the number of
processors used by the program, and then tests if that number can be
fft. On exit subroutine
factor_nodesreturns an array:
After the return of
factor_nodes we perform the following
Consider some examples to see how this works.
First assume that
np = 2h. Then
f1 = np / 2h = 1 and
minpower2 will adjust
nearest power of two greater than or equal to
and multiplication by f1 = 1 won't change anything. In short, in this
context all that we really do is:
nx = minpower2(4*(dif_nx + 1))and this is much the same as we have already seen in
expand_temp_profile. Furthermore we still don't check if nx is the multiple of the number of nodes, sic!
Why is there
4*(dif_nx + 1) instead of just
2*(dif_nx + 1),
as was sufficient in
The reason for this is that when we calculate matrix
Fwe will need terms such as
In other words, if k and k' run from 1 through 7, then k+ runs from
1 through 14. In principle k- would run from -6 through 6, but
we have already shown that in our case
and ditto for j. In short, this time we need to have Fourier coefficients
from 1 through 2 nx and from 1 through 2 ny,
and then, in order to stay away from negative frequencies we need to double the number
of Fourier coefficients still.
But since Fourier coefficients are cheap, and you only need to get them once, you can just as well avail yourself of more than you really need.
Now assume that
np = 2h 31. Then
f1 = np / 2h = 3 and
minpower2will return 16. Now we multiply that 16 by 3 and get 48. Will this work for, say, 12 nodes? , , and , so we have satisfied both conditions, i.e., that transform length must be a multiple of the number of nodes and that it must be of the form 2h 3i 5j 7k 11mtoo.
But observe that the total number of nodes still falls out of the equation. We'll be alright going for and for , but not for . The subroutine doesn't check if nx and ny evaluated by that process are indeed larger than the number of processors. Admittedly, not many people can afford 96 node machines, and those who can are unlikely to run this demonstration program on all 96 nodes either. Still, this is an omission, or a bug, and, as you see, the procedure could still be improved.
In summary, both ways to calculate the transform length in
expand_temp_profile and in
botched. They demonstrate two techniques of attacking the problem,
but both have lose ends. My preference would be to go for the simpler
one, i.e., for a power of 2 and to forget about the complexities
associated with filling the gaps between the powers of 2 with
2h 3i 5j 7k 11m, unless you really, really need it.
Forcing the number of nodes to be a power of 2 too can be done easily
on program entry. Transform length can be then adjusted simply by
taking a multiple of the number of nodes and comparing that with
the required number of harmonics. On our SP you could then run the
program on 2, 4, 8, 16, or 32 nodes.
Another alternative would be to evaluate
fft locally, i.e.,
not to make
atmp distributed arrays. For our
problem those arrays are so small, and even sequential
so fast, that we don't gain anything by parallelising this computation.