Search code examples
pythonhdf5pytables

Error when trying to save hdf5 row where one column is a string and the other is an array of floats


I have two column, one is a string, and the other is a numpy array of floats

a = 'this is string'

b = np.array([-2.355,  1.957,  1.266, -6.913])

I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas

hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')

z = pd.DataFrame(
{
 'string': [a],
 'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()

However, I get this error

TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype

Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?


Solution

  • I can't help you with pandas, but can show you how do this with pytables. Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.

    Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table. It shows 2 different methods to add data:
    1. A list of tuples (1 tuple for each row) - see append_list
    2. A numpy recarray (with dtype matching the table definition) - see simple_recarr in the for loop

    To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
    Pytables Users's Guide

    import tables as tb
    import numpy as np
    
    with tb.open_file('SO_55943319.h5', 'w') as h5f:
    
        my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
        dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
    
    # Append one row using a list:
        append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
        dset.append(append_list)
    
        simple_recarr = np.recarray((1,),dtype=my_dtype)
    
        for i in range(5):
    
            simple_recarr['A']='string_' + str(i)
            simple_recarr['b']=2.0*i
            simple_recarr['c']=3.0*i
            simple_recarr['d']=4.0*i
            simple_recarr['e']=5.0*i
    
            dset.append(simple_recarr)
    
    print ('done')