Before you can begin writing on an MPI file in parallel, each process participating in the operation must acquire its own view of that file. A view is defined in terms of 3 parameters: a displacement, which is a location in the file given as the number of bytes from the beginning of the file, an elementary data type, the etype, and a filetype about which more below.
This business about views, filetypes and etypes is a little hard to understand without an example. Assume that we have some etype such as, e.g., a particle structure. This is a record that comprises a number of doubles, some integers, and some characters. We have seen how to build the corresponding MPI derived data type in one of the previous sections. Now, let us build a new MPI derived data type, which, say, picks up a second and third particle from an array of 6 particles. Symbolically we can write it as follows:
Xin the first row stands for the etype and the second row represents the new derived data type with one hole,
O, in front, then two particles,
XX, and then 3 holes,
Let us define three derived MPI types as follows:
type1 = XXOOOO type2 = OOXXOO type3 = OOOOXX
A view, as I have said above,
is a triple
(displacement, etype, filetype). Define the following three
If these are the views that correspond to three different processes, when a parallel read takes place, the first process will read the first two particles, the second process will read particles 3 and 4 and the third process will read particles 5 and 6. Then the pointer is advanced to the beginning of the next filetype item and the read operation can commence. The view of the file that the first process has is:(0, X, XXOOOO) (0, X, OOXXOO) (0, X, OOOOXX)
The second process sees the following data on the file:XXOOOO XXOOOO XXOOOO XXOOOO ...
And the third process' view is:OOXXOO OOXXOO OOXXOO OOXXOO ...
In this example all processes' view has the same displacement, but the filetypes are different. A similar effect same can be accomplished by giving 3 different displacements and sharing the same file view:OOOOXX OOOOXX OOOOXX OOOOXX ...
In summary: in order to avoid stepping on each other's toes, each process must have a different view of the shared file. If the views are constructed soundly, then each process is going to work on a different portion of data.(0, X, XXOOOO) (sizeof(XX), X, XXOOOO) (sizeof(XXXX), X, XXOOOO)
So how do you construct a view? Use function:
in C andint MPI_File_set_view (MPI_File fh, MPI_Offset displacement, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info);
in Fortran.MPI_FILE_SET_VIEW(FH, DISP, ETYPE, FILETYPE, DATAREP, INFO, IERROR) INTEGER FH, ETYPE, FILETYPE, INFO, IERROR CHARACTER*(*) DATAREP INTEGER(KIND=MPI_OFFSET_KIND) DISP
There is one parameter here in these interfaces, which I haven't talked about yet. It is the Data Representation parameter, which is a string.
MPI guarantees full interoperability within a single MPI environment, but there is little support in it, as yet, for external data representation. Yet, the moment you begin writing MPI files, this issue gains in importance, because you are quite likely to process those files on a variety of architectures. The following predefined Data Representation strings are currently available
Once you've set a view on a file, you can also get it back with function
in C and similarly in Fortran.int MPI_File_get_view (MPI_File fh, MPI_Offset *displacement, MPI_Datatype *etype, MPI_Datatype *filetype, char *datarep);
So, at this stage all processes should have opened a file and should have defined their view on that file. Now we can begin to write data to the file and to read data from it.
Assuming that you have structure the data on the file with etype and filetype definitions the simplest way to write data on a file is to call function
in C andint MPI_File_write (MPI_File fh, void *buffer, int count, MPI_Datatype datatype, MPI_Status *status);
This function transfersMPI_FILE_WRITE(FH, BUFFER, COUNT, DATATYPE, STATUS, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
countdata items of type datatype from a buffer pointed to by
fh. The data will be written at a position in the file pointed to by the file pointer. This operation will advance the pointer according to the formula:
datatypeis the same as
filetype, which is a sensible thing to do, then the pointer will get advanced, in units of etype, by
filetypes, so that, in effect, the reading of the file will proceed as in the example discussed above.
Once you've written some data on the file you can read it back with
in C and withint MPI_File_read(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status)
in Fortran.MPI_FILE_READ(FH, BUF, COUNT, DATATYPE, STATUS, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
These two functions,
MPI_File_read are blocking and non-collective, i.e., each
process does the reads on its own. Each process can read the file
differently and in its own way and time. There is no barrier. Some
processes may choose to read their data chunks from the file, some
may forgo reading altogether, depending on what they do.
There is a collective version of these calls, which forces all
processes in the communicator to read data simultaneously and to wait
for each other. These collective functions are called
their synopsis (though not their semantics) is the same as for the
There are also non-blocking versions of of these functions. They are:
andint MPI_File_iwrite(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Request *request) MPI_FILE_IWRITE(FH, BUF, COUNT, DATATYPE, REQUEST, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, REQUEST, IERROR
As you see the list of parameters is the same with the exception thatint MPI_File_iread(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Request *request) MPI_FILE_IREAD(FH, BUF, COUNT, DATATYPE, REQUEST, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, REQUEST, IERROR
statusis replaced with
request. You have to keep inspecting the
requestto check if the operation has completed. Then you can inspect the status with another MPI function.
These non-blocking writes and reads are very useful. Any external I/O operations are excruciatingly slow compared with memory access or with operations that are done on the registers. Consequently if you can organise your program so that you issue a non-blocking I/O request in advance, then go back to your computations and keep checking every now and then if the I/O operation completed, you'll be able to mask the slowness of I/O with computations. Programs like that can be very fast. But they are also extremely difficult to write and to debug.
The functions discussed so far perform sequential writes within their respective views. What if you want to write data at various locations within your view jumping here and there out of order?
For this you would use a family of functions with the extension
_AT. These functions are like the functions already discussed,
but they take one more parameter, namely the offset from the beginning
of the view.
File offsets in MPI/IO are always given in terms of etypes and are always measured from the beginning of the view. This is a matter of semantics, naming, and to agree on this simply saves unnecessary confusion. File displacements on the other hand are given in bytes and are measured from the beginning of the file.
The synopsis for the
_AT functions is as follows:
and similarly for the nonblocking and collective versions.int MPI_File_write_at(MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) MPI_FILE_WRITE_AT(FH, OFFSET, BUF, COUNT, DATATYPE, STATUS, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR INTEGER(KIND=MPI_OFFSET_KIND) OFFSET int MPI_File_read_at(MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) MPI_FILE_READ_AT(FH, OFFSET, BUF, COUNT, DATATYPE, STATUS, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR INTEGER(KIND=MPI_OFFSET_KIND) OFFSET
There is one more group of MPI reads and writes. For all functions discussed
above every process would maintain its own file pointer. In the
_AT functions that pointer would be manipulated explicitly.
MPI_FILE_READ it would be advanced implicitly. But every process
would end up reading different data.
What if we want all processes to read the same data from the same file?
In this case we need to use data access functions with shared file pointers. The functions are:
and they also have their collective and non-blocking counterparts.int MPI_File_write_shared(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) MPI_FILE_WRITE_SHARED(FH, BUF, COUNT, DATATYPE, STATUS, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR int MPI_File_read_shared(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) MPI_FILE_READ_SHARED(FH, BUF, COUNT, DATATYPE, STATUS, IERROR) <type> BUF(*) INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR