Search code examples
pythonhdf5h5py

"Cannot create cython.array from NULL pointer" error when indexing an HDF5 dataset in Python


I am using the h5py package to extract data from an HDF5 file and manipulate it using Python. There is a dataset called "Bodies" in the file h5Test.pph, so I first set up with:

import h5py
f = h5py.File('h5Test.pph', 'r')
bodies = f['Bodies']

From there I am able to access most indices in bodies (e.g. 0, 4, 1000), but for some reason bodies[2] and bodies[3] result in this error

ValueError: Cannot create cython.array from NULL pointer

I have used the h5dump command line tool to confirm that these entries exist, and nothing looks strange about the data. I am new to both HDF5 files and posting on stack overflow, so please let me know if there is any additional information that would be useful.

Edit for additional information:

numpy.shape(bodies) returns

(10689,)

and numpy.dtype(bodies) returns

dtype({'names':['ID','Name','Orientation','Color','Position','Velocity','Angular velocity','Change in w in body frame','Force','Torque','Additional force','Temperature','Angular momentum','Principal moments of inertia','Mass','Volume','Scale','Shape','Group','Material','Mode','Lua control functions','Monitored','Stress'], 'formats':[[('ID', '<i8')],[('data', 'O')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4'), ('w', '<f4')],[('red', '<f4'), ('green', '<f4'), ('blue', '<f4'), ('alpha', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],'<f4',[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],[('x', '<f4'), ('y', '<f4'), ('z', '<f4')],'<f4','<f4','<f4',[('ID', '<i8')],[('ID', '<i8')],[('ID', '<i8')],'<u4','O','u1',{'names':['[0, 0]','[0, 1]','[0, 2]','[1, 0]','[1, 1]','[1, 2]','[2, 0]','[2, 1]','[2, 2]'], 'formats':['<f4','<f4','<f4','<f4','<f4','<f4','<f4','<f4','<f4'], 'offsets':[0,12,24,4,16,28,8,20,32], 'itemsize':36}], 'offsets':[0,8,16,32,48,60,72,84,96,108,120,132,136,148,160,164,168,172,180,188,196,200,216,217], 'itemsize':253})

Also for example, bodies[0] returns

((1487,), (b'compactor_disk',), (0., 0., 0., 1.), (0.38671875, 0.38671875, 0.38671875, 0.5859375), (0., 0., 0.), (0., 0., 0.), (0., 0., 0.), (0., 0., 0.), (-0.72721094, 0.20889588, -41.384094), (0.01420393, 0.34127262, 0.05411187), (0., 0., 0.), 0., (0., 0., 0.), (2.5448524e-06, 2.5513377e-06, 4.367137e-06), 0.00292779, 3.9037224e-07, 10., (14954,), (1736,), (1738,), 16, array([(22769,)], dtype=[('ID', '<i8')]), 1, (1035920.44, -53857.83, 874206.7, 84758.26, 1571146.4, -36402.49, -16.688602, -17.05545, 553.0667))

Using the command h5dump -d Bodies h5Test.pph yields a rather long output, but one of the elements that gave the error is this one:

   (2): {
         {
            1489
         },
         {
            "lid"
         },
         {
            0,
            0,
            0,
            1
         },
         {
            0.386719,
            0.386719,
            0.386719,
            0.585938
         },
         {
            0,
            0,
            0.3
         },
         {
            0,
            0,
            0
         },
         {
            0,
            0,
            0
         },
         {
            0,
            0,
            0
         },
         {
            0,
            0,
            0
         },
         {
            0,
            0,
            0
         },
         {
            0,
            0,
            0
         },
         0,
         {
            0,
            0,
            0
         },
         {
            2.59831e-06,
            2.60443e-06,
            4.41639e-06
         },
         0.00282375,
         3.765e-07,
         10,
         {
            15293
         },
         {
            1736
         },
         {
            1738
         },
         16,
         (),
         0,
         {
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
         }
      },

Solution

  • Let's start with some HDF5 and h5py basics. When you enter bodies = f['Bodies'], the return (bodies) is a h5py dataset object that behaves like a NumPy array. You get details of the dataset from your shape and dtype calls.

    This dataset is similar to a recarray with 10689 rows of heterogeneous data. The dtype of each field(column) is defined by a dictionary of 2 keys: 'names' and 'formats' -- they work as a paired list. For example, Field 1 is an integer array named 'ID'; Field 2 is an Python object named 'Name'; Fields 3 an array of 4 floats named 'Orientation', and the array members are: 'x', 'y', 'z', 'w' respectively. This continues down the names/format pairs. Some of the fields are much more complicated: the last one, 'Stress' references another dictionary, and 'Lua control functions' is another Python Object. (HDF5 supports data structures that don't map to standard NumPy datatypes -- in these situations h5py uses them in recarrays -- examples include nd.arrays, Lists, Dictionaries etc. )

    So, when you enter bodies[i], you are reading the i-th row of data from the dataset. This is how your output from bodies[0] maps to the dataset:

    bodies[0]['ID'] = 1487 
    bodies[0]['Name'] = b'compactor_disk'
    bodies[0]['Orientation'] = (0., 0., 0., 1.)
    

    And, based on the output from h5dump this is how the output for bodies[2] should map to the dataset:

    bodies[2]['ID'] = 1489
    bodies[2]['Name'] = b'lid'
    bodies[2]['Orientation'] = (0, 0, 0, 1) 
    Note they look like ints and not floats - not sure if that is a problem.
    

    As @hpaulj notes, the 'Lua control functions' output looks different for the 2 rows. bodies[2] has a zero length array, and that could be the problem.

    You can access the data in each field/column individually (by name). Create a loop and see if you can isolate which field causes the problem. The code below is the way you can do that:

    with h5py.File('h5Test.pph', 'r') as h5f:
        bodies = h5f['Bodies']
        for field in bodies.dtype.names:
            print('reading field:',field)
            temp = bodies[field]