Search code examples
pythonhdf5h5py

How to define an individual data type for each HDF5 column with h5py


I have checked different solutions, but could not understand how to apply them to multidimensional arrays. To be precise, my code results in a larger array than it should be, as shown in the picture below:

import h5py
import pandas as pd
import numpy as np

data = [[1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861]]

df = pd.DataFrame(data)

hf = h5py.File('dtype.h5', 'w')

dataTypes = np.dtype([('ts', 'u8'), ('x', 'f4'), ('y', 'f4'), ('z', 'f4'), ('temp', 'f4')])
ds = hf.create_dataset('Acceleration', data=df.astype(dataTypes))

enter image description here

I would like to make it like this, where the columns are uint64, 4x float32 respectively:

                 ts         x         y         z      temp
0  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
1  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
2  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
3  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
4  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
5  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
6  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
7  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
8  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
9  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898

Solution

  • Your df:

    In [370]: df                                                                                   
    Out[370]: 
                      0         1         2         3         4
    0  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
    1  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
    2  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
    3  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
    ...
    

    df.astype(dataTypes) gives me a TypeError (my pd isn't the latest).

    In [373]: df.to_records()                                                                      
    Out[373]: 
    rec.array([(0, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (1, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (2, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (3, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (4, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (5, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (6, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (7, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (8, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
               (9, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821)],
              dtype=[('index', '<i8'), ('0', '<i8'), ('1', '<f8'), ('2', '<f8'), ('3', '<f8'), ('4', '<f8')])
    

    This array should save with h5py.

    to_records has parameters that may create something closer to your dataTypes. I'll let you explore those.

    But with the latest restructuring a recfunctions, we can make a structured array with:

    In [385]: import numpy.lib.recfunctions as rf                                                  
    In [386]: rf.unstructured_to_structured(np.array(data), dataTypes)                             
    Out[386]: 
    array([(1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
           (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898)],
          dtype=[('ts', '<u8'), ('x', '<f4'), ('y', '<f4'), ('z', '<f4'), ('temp', '<f4')])
    

    np.array(data) is (10,5) float array.

    In [388]: pd.DataFrame(_386)                                                                   
    Out[388]: 
                     ts         x         y         z      temp
    0  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
    1  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
    2  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
     ...