Search code examples
pythonpandashdf5hdfstore

Iteratively append pandas dataframes in a single group to h5 file


I have a small script meant to read csv files from a user input directory and convert them to a single HDF5 file:

path = input('Insert the directory path:')

file_list = []
for file in glob.glob(path):
    file_list.append(file)


for filename in file_list:
    df = pd.read_csv(filename)
    key = Path(filename).resolve().stem
    with pd.HDFStore('test.h5') as store:
        store.append(key=key, value=df, format='table', data_columns=df.columns)

What this is currently doing is appending each file (in dataframe format) as a group. If I open it in vitables it looks something like this:

enter image description here

Also, if I run the script again using another directory, it will continue appending new groups (one for each file) to the root group.

What I would like is everytime I run the script, it appends the file groups inside a new group (subject) in the root. Something like this:

enter image description here

I feel like this has probably something to do with the keys im passing in store.append, because right now its using the file name as the key. I was able to manually pass the keys and append the desired dataframe, but that is not the endgoal i wanted.

Some advice would be great! Thank you


Solution

  • import glob
    import os
    import pandas as pd
    
    # inputs
    path = input('Insert the directory path:')
    group = input('Insert a group name: ')
    
    # create a list of file paths
    file_list = [file for file in glob.glob(path)]
    # dict comprehension to create keys from file name and values from the csv files
    dfs = {os.path.basename(os.path.normpath(filename)).split('.')[0]: pd.read_csv(filename) for filename in file_list}
    
    # loop though the dataframes
    for k,df in dfs.items():
        # store the HDF5 file
        store = pd.HDFStore('test.h5')
        # append df to a group and assign the key with f-strings
        store.append(f'{group}/{k}', df, format='table', data_columns=df.columns)
        # close the file
        store.close()
    

    I ran the above code twice once for the group sample and the group sample1 Below are the results:

    import h5py
    # load file
    f = h5py.File('test.h5', 'r')
    print(f['sample'].keys())
    print(f['sample1'].keys())
    f.close()
    
    <KeysViewHDF5 ['untitled', 'untitled1']>
    <KeysViewHDF5 ['untitled2', 'untitled3']>
    

    enter image description here