Official website and documentation: http://h5py.org
Mailing list (both dev and discussion): Google Groups
See Building h5py for installation instructions.
HDF5 is an open-source library and file format for storing large amounts of numerical data, originally developed at NCSA. It is widely used in the scientific community for everything from NASA’s Earth Observing System to the storage of data from laboratory experiments and simulations.
Over the past few years, HDF5 has rapidly emerged as the de-facto standard technology in Python to store large numerical datasets. The h5py package is a Pythonic, easy-to-use but full featured interface to HDF5.
The package is designed with two major goals in mind:
The files you create can be read by anyone else using HDF5-enabled software, whether they’re using Python, IDL, MATLAB or another software package.
HDF5 files hold datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups.
The most fundamental thing to remember when using h5py is:
Groups work like dictionaries, and datasets work like NumPy arrays
The very first thing you’ll need to do is create a new file:
>>> import h5py >>> import numpy as np >>> >>> f = h5py.File("mytestfile.hdf5", "w").
The File object is your starting point. It has a couple of methods which look interesting. One of them is create_dataset:
>>> dset = f.create_dataset("mydataset", (100,), dtype='i')
The object we created isn’t an array, but an HDF5 dataset. Like NumPy arrays, datasets have both a shape and a data type:
>>> dset.shape (100,) >>> dset.dtype dtype('int32')
They also support array-style slicing. This is how you read and write data from a dataset in the file:
>>> dset[...] = np.arange(100) >>> dset 0 >>> dset 9 >>> dset[0:100:10] array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
“HDF” stands for “Hierarchical Data Format”. Every object in an HDF5 file has a name, and they’re arranged in a POSIX-style hierarchy with /-separators:
>>> dset.name u'/mydataset'
The “folders” in this system are called groups. The File object we created is itself a group, in this case the root group, named /:
>>> f.name u'/'
Creating a subgroup is accomplished via the aptly-named create_group:
>>> grp = f.create_group("subgroup")
All Group objects also have the create_* methods like File:
>>> dset2 = grp.create_dataset("another_dataset", (50,), dtype='f') >>> dset2.name u'/subgroup/another_dataset'
By the way, you don’t have to create all the intermediate groups manually. Specifying a full path works just fine:
>>> dset3 = f.create_dataset('subgroup2/dataset_three', (10,), dtype='i') >>> dset3.name u'/subgroup2/dataset_three'
Groups support most of the Python dictionary-style interface. You retrieve object in the file using the item-retrieval syntax:
>>> dataset_three = f['subgroup/dataset_three']
Iterating over a group provides the names of its members:
>>> for name in f: ... print name mydataset subgroup subgroup2
Containership testing also uses names:
>>> "mydataset" in f True >>> "somethingelse" in f False
You can even use full path names:
>>> "subgroup/another_dataset" in f True
There are also the familiar keys(), values(), items() and iter*() methods, as well as get().
Since iterating over a group only yields its directly-attached members, iterating over an entire file is accomplished with the Group methods visit() and visititems(), which take a callable:
>>> def printname(name): ... print name >>> f.visit(printname) mydataset subgroup subgroup/another_dataset subgroup2 subgroup2/dataset_three
One of the best features of HDF5 is that you can store metadata right next to the data it describes. All groups and datasets support attached named bits of data called attributes.
Attributes are accessed through the attrs proxy object, which again implements the dictionary interface:
>>> dset.attrs['temperature'] = 99.5 >>> dset.attrs['temperature'] 99.5 >>> 'temperature' in dset.attrs True
The h5py package supports every Numpy type which maps to a native HDF5 type, and a few others.
Some additional types h5py supports, brought from HDF5:
For example, variable-length strings let you store Python-style (as opposed to fixed-width “S”) strings using native HDF5 constructs. No Python-specific code or pickling is used.
Create a dtype object to represent these by using special_dtype:
>>> dt = h5py.special_dtype(vlen=str) # bytes/str/unicode all supported
Then create your dataset using that type:
>>> dset = f.create_dataset("stringy", (2,), dtype=dt) >>> dset = "Hello" >>> dset = "Hello this is a longer string" >>> dset[...] array([Hello, Hello this is a longer string], dtype=object)