The purpose of MPI/IO is to provide high performance, portable, parallel I/O interface to high performance, portable, parallel MPI programs. Parallel I/O is not a daily bread. Although some supercomputer systems in the past offered parallel disk subsystems, e.g., the Connection Machine CM5 had a Scalable Disk Array, SDA, the Connection Machine CM2 had the Data Vault and IBM SP had PIOFS and today it has GPFS, communication with those peripherals was architecture and operating system dependent.
Yet I/O is such an important part of scientific computation that a system that provides parallel CPUs but only sequential I/O can hardly be called a supercomputer.
MPI I/O, which was contributed to the MPI-2 standard by NASA, builds on MPI derived data types and collective communications, so that the resulting semantics are very similar to the former two. MPI/IO files are shared, i.e., multiple processes running on multiple CPUs can operate on a single MPI/IO file all at the same time. The file may be spread over the disk systems that belong to those CPUs or it may be spread over some other parallel disk system which can be accessed by the CPUs over parallel communication channels.
Each MPI file is written as a sequence of etypes. An etype, which stands for elementary datatype is the unit of data access and positioning. But etypes don't really have to be all that elementary. Any derived MPI type can be used as an etype. Since by now you ought to know how complex derived MPI types can be, you should appreciate how rich a structure of MPI files can be too.
Etypes can then be organized additionally into a filetype. A filetype describes data distribution, in terms of etypes and etype-size holes, within the file. The description given by a filetype can be very complex, much like etypes themselves can get as complex as the context demands, and as the programmer can cope with.
MPI processes which open a shared MPI file acquire their own views of that file. A view is what a given process can see inside the file. All I/O operations performed by that process occur within the view. Normally one would design the whole I/O in such a way that the views of separate processes would not overlap, but they can overlap.
In this section we will first look at how to manipulate MPI files. Then we'll discuss reading data from and writing data to MPI files. Then we're going to discuss filetype constructors, which are in a category similar to MPI datatype constructors, and, finally, we'll have a look at some examples.
On our SP system you can call MPI/IO functions against two file systems. The first one is GPFS. If you use GPFS then all processes that your job runs on must be GPFS clients and must mount the same GPFS file system, and, of course, this is the file system that you will write to. The other file system you can call MPI/IO functions against is HPSS. You can perform MPI/IO transactions with HPSS from any SP node that has DCE and Encina libraries installed and configured.
At present GPFS is available on all nodes that run parallel production
jobs, i.e., nodes that support classes
not on the
test nodes, which, of course, makes testing
MPI/IO jobs somewhat difficult.
GPFS MPI/IO programs don't require linking with any special libraries
other than what you normally get if you call the
mpcc or the
mpxlf90 wrappers. HPSS MPI/IO programs need to be linked with:
in this order. You will also have tolibmpioapi.a libhpss.a libmpi.a libEncina.a libEncClient.a libdce.a libdcepthreads.a libpthreads_compat.a libpthreads.a
MPIO_LOGIN_NAMEin your environment, it should be set to your HPSS user name
MPIO_KEYTAB_PATHand point it to where you store the keytab file
HPSS_LS_SERVERand point it to the HPSS Location Server
MPIO_DEBUGto whatever level of MPI/IO debugging messages you want to receive.
#include <mpio.h>in your program and point to the HPSS version of MPI/IO includes