Restart the same program,
rtrace_bug, but without
the tracefile option this time:
gustav@sp20:../MPI_hangs 12:22:06 !518 $ rtrace_bug -procs 4 -labelio yes 0:Control #0: No. of nodes used is 4 0:Control: expect to receive 2500 messages 1:Compute #1: checking in 3:Compute #3: checking in 2:Compute #2: checking in 2:Compute #2: done sending. 2:Task 2 waiting to complete. 3:Compute #3: done sending. 3:Task 3 waiting to complete. 1:Compute #1: done sending. 1:Task 1 waiting to complete.When the program hangs, in another window type:
<22:06 !504 $ ps -u gustav | grep poe | grep -v grep 43098 34200 pts/1 0:00 poe gustav@sp20:../MPI_hangs 13:09:01 !505 $This gives us the POE process id number, which, in this case is 34200 (43098 is my uid number).
Now, in the same window attach the
to the POE process:
gustav@sp20:../MPI_hangs 13:18:50 !507 $ pedb -a 34200 pedb Version 2, Release 3 -- Oct 13 1998 21:56:50 Warning: Cannot convert string "Rom10.500" to type FontStructA window will pop up listing all four tasks and their PID numbers on respective nodes.
Attach All button. The original window will go away,
and you'll get a very large multi-panelled window filling the
whole display. The
Stack panel shows stack listings
for all participating processes. You'll see that they all
hang on internal MPI function calls, which do not have
line numbers. But as you go down the stack you eventually
find function calls with reference to appropriate line
numbers within the code, e.g., task 0 should flag:
collect_pixels(), line 68whereas the other tasks should flag:
main(), line 25Double click on the line
collect_pixels()in the task 0 stack listing: the code should now appear in the large window on the left with the offending line, in this case
MPI_Recv( pixel_data, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, ...Go to the
Global Datapanel (it may be hidden, in which case you will need to stretch it a little so that it will show its window and push buttons) and right click on the
Task 0push button. A small menu will pop up, select
Show All. Repeat this for all other tasks.
Look at the local data values. Observe that for task 0
is 100, which means that task 0 thinks that it is still going
to receive 100 messages.
You can look the same way at the other tasks and you'll find that they're all stuck waiting at the barrier.
The problem is therefore solved. Task 0 expected to receive 2500 messages, but received 2400 only.