In this section we are going to look at how to handle errors in MPI programs.
There are two categories of errors we are going to encounter. UNIX errors and MPI errors.
MPI errors can arise when messages are incorrectly constructed, addressed, sent or received, or when they get lost because of network problems or because some nodes that participate in the communication may have crashed. The latter falls into the category of largely intractable problems. Even if we could diagnose such problems from within a program, there is little we could do about them. Consequently, MPI does not really provide mechanisms for dealing with failures in the communication system. Instead, the MPI-2 standard assumes that MPI implementors are going to ``insulate the user from this unreliability'' by developing systems that are sufficiently robust and fault tolerant to the extent allowed by the present day technology.
But the former, i.e., MPI program errors, ought to be tractable, and it is here that MPI provides mechanisms for handling recoverable errors. But this requires explicitly changing the default behaviour of an MPI program, which is to abort all parallel computation on having detected an MPI program error.
We are going to talk more about it in the second part of this section.
In the first part we are going to look at how we can use MPI itself (with the assumption that MPI, at least, does not fail) in order to process UNIX errors that may arise during program execution.
Here the difficulty is caused by the fact that on capturing a UNIX error a process should not just exit. Well, it may, but this would take down the whole parallel program and we may be none the wiser as to why it happened. The trick is to capture any UNIX problems that may arise and then to build the logic of the parallel program around them.