Question : Read H5File from a folder inside a zipped folder into pandas dataframe
Background : The directory structure I have looks like this:
file.zip/2019/file.h5
file.zip is the zipped folder
2019 is the folder inside the zipped folder
I can extract the folder using extractall and read the h5 file from the folder. However, looking to read it directly from the zipped folder to pandas dataframe.
Code to create a sample file:
Here is the code to recreate a sample h5 file that I am trying to use in this scenario:
Step 1:
import h5py
file = h5py.File('sample.h5','w')
dataset = file.create_dataset("dset",(4, 6), h5py.h5t.STD_I32BE)
file.close()
Step 2:
After the file is created, put it in a folder "2019". Place "2019" inside another folder called zipfolder and zip it. So now the directory structure looks like "file.zip/2019/file.h5"
Note: This is an H5py file and HDFStore. Pandas read_hdf cannot work on H5Files. Read on HDF5 documentation for more clarity on H5 Files and HDFStore. They both have different internal structure however the same .h5 extension.For reading H5 Files, h5py package is used.
Figured this out with the help of H5py google group:https://groups.google.com/forum/m/#!forum/h5py
import zipfile
import h5py
import pandas as pd
print(h5py.__version__)# Make sure the version is 2.9 or above
zf = zipfile.ZipFile('zipfolder.zip')
print(zf.namelist())# get the name of the fileobject
fiz = zf.open('zipfolder/2019/sample.h5')
hf = h5py.File(fiz,'r')
print(list(hf.keys())) # To see the datasets inside h5 File
df = pd.DataFrame(hf['dset'][:])
df.head()
Used h5py to read h5Files. Pandas reads only the HDFStore formats that have structured dataframe formats and doesn't read h5files as of now.