Search code examples
pandashdf5

Writing data frame with object dtype to HDF5 only works after converting to string


I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.

import pandas as pd
import numpy as np

length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})

# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")

Uncommenting various lines leads me to the following.

  • Just filling the column with numbers will lead to a data type int32 and stores without problem
  • Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
  • Explicitely converting the column to str leads to success, and to_hdf stores the data.

Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.


Solution

  • Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib

    Storing and loading data is straight forward:

    import joblib
    joblib.dump(df, "file.jl")
    df2 = joblib.load("file.jl")