Search code examples
pandashdf5daskpytableshdfstore

How do I combine multiple pandas dataframes into an HDF5 object under one key/group?


I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.

readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])

Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]

My approach so far has been:

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    store.append(hdf5_key, df, data_columns=csv_columns, index=False)

That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:

  Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'

So, I could try to save everything into one pandas dataframe first, i.e.

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    total_df = pd.concat([total_df, df])   # creates one big CSV

and now store into HDF5 format

    store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)

However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.

So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?

EDIT: Here's a concrete example of a csv file with different data types:

 order    start    end    value    
 1        1342    1357    category1
 1        1459    1489    category7
 1        1572    1601    category23
 1        1587    1599    category2
 1        1591    1639    category1
 ....
 15        792     813    category13
 15        892     913    category5
 ....

Solution

  • Your code should work, can you try the following code:

    import pandas as pd
    import numpy as np
    
    store = pd.HDFStore("file.h5", "w")
    hdf5_key = "single_key"
    csv_columns = ["COL%d" % i for i in range(1, 56)]
    for i in range(10):
        df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
        store.append(hdf5_key, df,  data_column=csv_columns, index=False)
    store.close()
    

    If the code works, then there are something wrong with your data.