Search code examples
pythoncsvpandashdf5pytables

Convert large csv to hdf5


I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.


Solution

  • Use append=True in the call to to_hdf:

    import numpy as np
    import pandas as pd
    
    filename = '/tmp/test.h5'
    
    df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
    print(df)
    #    A  B
    # 0  0  1
    # 1  2  3
    # 2  4  5
    # 3  6  7
    # 4  8  9
    
    # Save to HDF5
    df.to_hdf(filename, 'data', mode='w', format='table')
    del df    # allow df to be garbage collected
    
    # Append more data
    df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
    df2.to_hdf(filename, 'data', append=True)
    
    print(pd.read_hdf(filename, 'data'))
    

    yields

        A   B
    0   0   1
    1   2   3
    2   4   5
    3   6   7
    4   8   9
    0   0  10
    1  20  30
    2  40  50
    3  60  70
    4  80  90
    

    Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

    Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.


    Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

    import numpy as np
    import pandas as pd
    
    filename = '/tmp/test.h5'
    store = pd.HDFStore(filename)
    
    for i in range(2):
        df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
        store.append('data', df)
    
    store.close()
    
    store = pd.HDFStore(filename)
    data = store['data']
    print(data)
    store.close()
    

    yields

        A   B
    0   0   1
    1   2   3
    2   4   5
    3   6   7
    4   8   9
    0   0  10
    1  20  30
    2  40  50
    3  60  70
    4  80  90