Search code examples
pythonpandashdf5h5pypytables

why does hdf5 file size increase dramatically when I segment the data to 30 smaller dataframes with 30 different keys


I have a series of hdf5 files holding large panda data frames. a typical file is around 1000,000 rows. I use complib='blosc',complevel=9 for compression. original hdf5 files are saved as 1 flat file.

I then tried segmenting the dataframes logically to 30 smaller dataframes and saved them in the same hdf5 file with 30 different keys and the same compression.

The shocking issue is that the files with 30 smaller dataframes are 40X larger than the flat file.

flat hdf5 file is saved as follows:

dfx.to_hdf(file_name, key='opp',mode='a',complib='blosc',complevel=9, append=True)

segemented hdf5 file is saved as follows:

for i in range(30): dfx_small[i].to_hdf(file_name,key='d'+str(i), mode='a',complib='blosc',complevel=9

Am I doing something wrong or is this size increase expected?

Additonal Observation

I compared all the hdf5 files generated as 1)flat dataframe vs 2)30 chunk dataframe - It seems largest files gain 8x to 10x in size when dataframe is saved as 30 smaller dataframe, meanwhile smaller files gain 100x to 1000x in size. I then experimented with saving the hdf5 files with 30 chunks with and without compression. It seems that when multiple dataframe with unique keys are placed in the same hdf5 file, the compression is almost non functional. I've tried all compression options with similar results.

There is a bug with compression when saving hdf5 files with multiple datasets.

I added Issue #45286


Solution

  • I created some simple tests, and discovered some interesting behavior.

    1. First, I created some data to mimic your description and saw a 11x increase in file size going from 1 DF to 30 DFs. So, clearly something's going on...(You will have to provide come code that replicates the 40x increase.)
    2. Next using the same dataframes above I created 2 uncompressed files -- I did not include the compression parameters: complib='blosc',complevel=9. As expected, the uncompressed files were larger, but the increase from 1 DF to 30 DFs was much lower (only 65% increase).

    Pandas Results

    # of DFs Compression Size (MB)
    1 Blosc-9 3.1
    30 Blosc-9 33.5
    1 No 24.8
    30 No 54.8

    So, some of this behavior could be due to compression differences with multiple, smaller dataframes. (Likely compression isn't as effective.) Also, Pandas uses an "interesting" schema to store dataframes. If you inspect the files with HDF View, the data does not look like a table (like you would see if you used PyTables or h5py to save the data). There may be overhead that is duplicated when you have multiple dataframes (but that is just an inference).

    Updated 2021-01-08:
    You can also save to HDF5 format using PyTables or h5py. (Pandas is built on top of PyTables.) Out of curiosity, I extended my example above to write HDF5 files with each package. One file has a single dataset and the other has 30 datasets (to mimic the dataframes method). I also compared effect of compression. Results provide interesting insights.

    PyTables Results

    # of DSs Compression Size (MB)
    1 Blosc-9 0.14
    30 Blosc-9 0.40
    1 Zlib-9 0.08
    30 Zlib-9 0.17
    1 No 11.9
    30 No 13.5

    Pandas vs PyTables Observations

    1. In all cases, the file written with PyTables is much smaller than the corresponding file from Pandas.
    2. For files written with PyTables, the number of datasets has a smaller affect on the file size.
    3. The effect of compression going from 1->30 datasets with PyTables is less significant than Pandas (with either compression library).
    4. Compression is much more effective with PyTables vs Pandas (as a percentage reduction).
    5. Conclusion: PyTables compression and HDF5 technology works well and is not the source of your problem.

    h5py Results

    # of DSs Compression Size (MB)
    1 Gzip-9 0.15
    30 Gzip-9 0.40
    1 No 11.9
    30 No 11.9

    Pandas vs h5py Observations

    1. In 3 out of 4 cases, the file written with h5py is much smaller than the corresponding file from Pandas. The only Pandas file that is smaller is the compressed file with 1 DF.
    2. For files written with h5py, the number of datasets does not affect file size.
    3. The effect of compression going from 1->30 datasets with h5py is insignificant (using gzip library).
    4. Compression is much more effective with h5py (as a percentage reduction).

    PyTables vs h5py Observations

    1. In most cases, the file written with PyTables is very similar in size as the corresponding file from h5py. The only h5py file that is smaller is the uncompressed file with 30 datasets.
    2. For uncompressed files written with h5py, the number of datasets does not affect file size.
    3. For uncompressed files written with Pytables, increasing the number of datasets increases the file size (by 14% with this test data).

    Create test data (as lists):

    import string
    
    col1 = list(string.ascii_letters)
    col2 = [ x for x in range(1,53)]
    col3 = [ float(x) for x in range(1,53)]
    
    ncopies = 18_000
    nslices = 30
    fact = len(col1)*ncopies//nslices
    col_c = []; col_int = []; col_float = []
    for i in range(ncopies):
        col_c.extend(col1)
        col_int.extend(col2)
        col_float.extend(col3)
        
    

    Write test data with Pandas:

    import pandas as pd
    
    dfx = pd.DataFrame({'col_c': col_c, 'col_int': col_int, 'col_float': col_float})
    dfx.to_hdf('pd_file_1_blosc.h5',key='test_data',mode='a',complib='blosc',complevel=9)
    dfx.to_hdf('pd_file_1_uc.h5',key='test_data',mode='a')        
    
    for i in range(nslices):
        dfx_small = dfx[i*fact:(i+1)*fact]
        dfx_small.to_hdf('pd_file_30_blosc.h5',key=f'd{i:02}', mode='a',complib='blosc',complevel=9)
        dfx_small.to_hdf('pd_file_30_uc.h5',key=f'd{i:02}', mode='a')
    

    Create NumPy recarray for PyTables and h5py:

    import numpy as np
    
    arr_dt = np.dtype([ ('col_c', 'S1'), ('col_int', int), ('col_float', float) ]) 
    recarr = np.empty(shape=(len(col_c),), dtype=arr_dt)
    recarr['col_c'] = col_c
    recarr['col_int'] = col_int
    recarr['col_float'] = col_float   
    

    Write test data with PyTables:

    import tables as tb
    
    f_blosc = tb.Filters(complib ="blosc", complevel=9)
    f_zlib  = tb.Filters(complib ="zlib", complevel=9)
    
    with tb.File('tb_file_1_blosc.h5','w') as h5f:
        h5f.create_table('/','test_data', obj=recarr, filters=f_blosc)
       
    with tb.File('tb_file_1_zlib.h5','w') as h5f:
        h5f.create_table('/','test_data', obj=recarr, filters=f_zlib)
        
    with tb.File('tb_file_1_uc.h5','w') as h5f:
        h5f.create_table('/','test_data', obj=recarr)
        
    with tb.File('tb_file_30_blosc.h5','w') as h5f:
        for i in range(nslices):
            h5f.create_table('/',f'test_data_{i:02}', obj=recarr[i*fact:(i+1)*fact],
                               filters=f_blosc)
    
    with tb.File('tb_file_30_zlib.h5','w') as h5f:
        for i in range(nslices):
            h5f.create_table('/',f'test_data_{i:02}', obj=recarr[i*fact:(i+1)*fact],
                               filters=f_zlib)
            
    with tb.File('tb_file_30_uc.h5','w') as h5f:
        for i in range(nslices):
            h5f.create_table('/',f'test_data_{i:02}', obj=recarr[i*fact:(i+1)*fact])
    

    Write test data with h5py:

    import h5py
    
    with h5py.File('h5py_file_1_gzip.h5','w') as h5f:
        h5f.create_dataset('test_data', data=recarr, compression="gzip", compression_opts=9)
       
    with h5py.File('h5py_file_1_uc.h5','w') as h5f:
        h5f.create_dataset('test_data', data=recarr)
        
    with h5py.File('h5py_file_30_gzip.h5','w') as h5f:
        for i in range(nslices):
            h5f.create_dataset(f'test_data_{i:02}', data=recarr[i*fact:(i+1)*fact],
                               compression="gzip", compression_opts=9)
            
    with h5py.File('h5py_file_30_uc.h5','w') as h5f:
        for i in range(nslices):
            h5f.create_dataset(f'test_data_{i:02}', data=recarr[i*fact:(i+1)*fact])