Next: Modifying Property Lists
Up: HDF5
Previous: HDF5 Groups and Datasets
So far we have used a default property list,
H5P_DEFAULT, wherever
such was required.
Property lists are a lot more in HDF5 than just a decoration or
the means of tweaking I/O here and there. Numerous functions,
including MPI-IO
based parallel IO are activated by
the means of property lists. Semantically, the device of a property
list saves HDF5 developers from having to overload HDF5 functions
with an excessive number of parameters, many of which may not be normally used
at all.
HDF5 property lists are opaque objects, which are manipulated by invoking
special functions only, and this is just as well, because it reduces the
chance of a programmer making a mistake.
Property lists are used to customize operations such as
- file creation
- accessing a file
- dataset creation
- dataset read/write
For every one of these operations HDF5 provides an operation specific
default property list, which can be then customized with various
functions from the H5P family (where ``P'' stands for Property).
And so we have for:
- file creation
H5P_FILE_CREATE default
property list;
- accessing a file
H5P_FILE_ACCESS default
property list;
- dataset creation
H5P_DATASET_CREATE default
property list;
- dataset read/write
H5P_DATASET_XFER default
property list.
What sort of customizations can we ask for by modifying the
default property lists? You can see them all quickly, if
you connect to
http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5P.html. You will
find there:
- 9 functions for customizing file creation properties;
- 35 functions for customizing file access properties
including 4 for interaction with MPI;
- 24 functions for customizing dataset creation;
- 23 functions for customizing dataset read/write including
2 for interaction with MPI.
Now, roughly half of these functions are get functions, which
just get you the existing property of some type, and half
are set functions, which set a specified property
in the list, so this whole property list business is not as intimidating
as it seems to be at first glance. But there is still a lot of hidden
functionality here.
Some specific properties that can be activated or de-activated
are as follows:
- file creation
- properties:
- user block
- HDF5 may leave a block of certain size
at the beginning of the file, for the user to fill with
whatever non-HDF5 data the user wants.
- byte size of offsets and lengths
- the byte sizes of offsets and
lengths on HDF5 files can be set to be 2, 4, 8 or 16. Normally
you will probably want them 8-bytes long, but they may default
to 4-bytes long on IA32 systems (e.g., AVIDD) - you may need
to check this and then correct.
- sizing the symbol table
- HDF5 files contain directories
(or groups) within, organized hierarchically. This
is done in a way that is similar to a directory structure,
i.e., there is a symbol table there, which is used to look
up a specific group or a dataset. You can set the size of parameters
used to control the symbol table nodes.
- sizing B-trees for chunked datasets
- So far we have seen
contiguous and rigorously pre-sized datasets. But HDF5 datasets
can be extended dynamically. This is done, again, in a way
that is similar to how files are written on disks. They are
not normally written contiguously. The writing process jumps
all over the disk writing the file wherever it finds space.
The locations of the chunks of data are then stored
on a B-tree, which has to be traversed in order to read
the whole file. You can also store data similarly within
an HDF5 file and a chunked dataset will then be
described by a B-tree, whose parameters can be controlled
by the means of property lists.
- file access
- properties:
- memory caching
- HDF5 can cache whole files in memory
on a specific request. All IO operations are then done against memory,
and the file itself may never even be written to disk - unless,
again, specifically requested. Alternatively, HDF5 can be
asked to use a specifically sized caches for the metadata and
for the raw data for files that are not going to be memory cached
in entirety.
- file families
- If you work with a file system that imposes
a limit on the size of the file, e.g., 2GBs (common to 32-bit
UFS and NFS), and your dataset exceeds this, you have the option
of writing your single logical HDF5 file physically
in the form of a family of files, all below the file system
size limit.
- logging
- You can activate logging on all IO requests against
an HDF5 file.
- splitting
- You can split a single logical HDF5 file so that
its metadata and data live on separate physical files. This
is similar to old MacOS file forks.
- MPI access
- MPI files are associated with MPI communicators,
and they may have MPI-IO info structures, that contain hints,
associated with them too. If an MPI file is to be written in
HDF5 format, then the communicator and the info structure must
be passed to HDF5 file access functions in the form of
a property list.
- Globus hooks
- In order to operate on files in the Globus
environment, the user must provide Globus hints on a special
Globus info structure (much like the MPI info structure).
These are then passed to HDF5 processes by the means of
a property list.
- SRB hooks
- HDF5 can also co-exist with the SDSC's Storage
Request Broker. SRB also requires an info structure to
operate on SRB files and this can be passed through a
property list.
- Streams
- HDF5 files can be made to stream directly into
IO-sockets
- dataset creation
- properties:
- Dataset layout
- HDF5 data sets can be laid out in three ways.
First, if the dataset itself is very small, it can be stored
in entirety, in the object header. This is similar to storing
very small files within an i-node in some file systems. Second,
the dataset may be stored contiguously, and then we may as well
request that it be chunked instead, in which case, the data
may be physically scattered throughout the whole file - in chunks.
The size of the chunks can be customized too.
- Data compression
- We may request that data stored on an HDF5
file be compressed. Both the compression method and the degree
of compression may be selected.
- Data filtering
- Compression is a form of data filtering, but
we may also request that a user-defined filter be applied to
data streams on writing and reading datasets. The filter may
be used, e.g., to encrypt the data, or to carry out on the
fly selection of data.
- dataset read/write
- properties:
- Error detection
- We may enable error detection on reads and
writes. Normally devices such as individual disk drives and
disk arrays do hardware level error detection anyway. Here you can add
your own additional error detection method, e.g., checksum.
- Callbacks
- If you have defined your own data filtering,
here you may additionally define a callback, which is going to
be activated when the filter fails.
- MPI-IO
- You can specify here whether a write or a read operation
against an HDF5/MPI-IO file is to be collective or
independent.
- Blocking
- If you do a lot of small reads and writes, you can
request that these operations be blocked, i.e., HDF5 will collect
all I/O until the block is full, and then only will the block
be transferred to the media. You can request a specific size
of the block.
These are not all properties and features available. They are
just the ones that caught my eye.
Even though HDF5 provides a lot
of functionality here, you should not run amok with it.
Remember that IO is always going to be orders of
magnitude slower than computation and memory data access. Consequently,
the best way to do IO is to do as little of it as possible. Get all
the data you need into memory, if you can, and then operate on it there.
After you finish, write it out and update the file. Read and write in
large blocks rather than in tiny amounts. Do not use files for temporary
scratch space, and certainly not to communicate between processes. The
so-called ``out-of-core'' jobs are unbelievably wasteful. If you lack
sufficient memory on, say, 8 nodes, go for 32 nodes,
but always try to fit all data you need to compute on in memory
itself.
Next: Modifying Property Lists
Up: HDF5
Previous: HDF5 Groups and Datasets
Zdzislaw Meglicki
2004-04-29