Search code examples
pythonhdf5pytablesh5py

Converting a dataset into an HDF5 dataset


I have a dataset that I would like to convert to an HDF5 format. It is a dataset from NOAA. The directory structure is something like:

NOAA
├── code
├── ghcnd_all
├── ghcnd_all.tar.gz
├── ghcnd-stations.txt
├── ghcnd-version.txt
├── readme.txt
└── status.txt

I am working with pandas for data analysis. The main reason I am interested in doing this is to save space, the dataset is ~25Gb.

How can I convert this dataset into a single .hdf5 file?


Solution

  • Data in HDF5 is stored in datasets which are homogeneous arrays that are possibly multidimensional with up to 32 dimensions each with up to an unsigned 64-bit integer length (number of columns) and containing datatypes of arbitrary size including compound datatypes for an upper limit over 16 exabytes for a single dataset. Datasets are meant to hold structured data such as numpy arrays, pandas DataFrames, images and spreadsheets. I have not found any way to directly put a plain text or tar.gz file into HDF5. However, using Python you could read a file into a string and put that into a dataset as shown at Strings in HDF5. In addition to datasets, groups are the other major object type in HDF5 and are containers for datasets and other groups. Datasets and groups are analogous to files and directories (or folders) and provide the basis for a hierarchical format like a Unix filesystem in which objects can be accessed with pathnames beginning with /. An HDF5 file is a container for possibly multiple datasets and groups and has no size limit.

    To get a better idea of what HDF5 is, I suggest downloading it and accompanying utilities plus HDFView from HDF5 Downloads, installing it all and then going through Learning HDF5 with HDFView, which can be done within 30 minutes. HDFView is a Java GUI that makes it easy to interact with HDF5, however you cannot simply drag and drop files into it but file data can be imported into a dataset. It is very easy to create HDF5 files and add DataFrames to them with pandas and that's a good method for putting data into a HDF5 file. Below is a demonstration of that. For more information about HDF5 you might take a look at other tutorials listed on HDF5 Tutorials, HDF5 Python Examples by API, Additional HDF5 Python Examples and the Python h5py package documentation at HDF5 for Python. For more information about pandas, 10 Minutes to pandas is a good place to start, followed by pandas Cookbook for a series of code examples and Python for Data Analysis by Wes McKinney, which is the best tutorial on pandas overall since he invented and developed most of it and is an excellent author.

    Here is an an example of using pandas to create an HDF5 file, load a DataFrame into it and retrieve and store a copy of it in another variable:

    In [193]: import pandas as pd
    
    In [194]: frame = pd.read_csv('test.csv')
    
    In [195]: frame
    Out[195]: 
       a   b   c   d message
    0  1   2   3   4     one
    1  5   6   7   8     two
    2  9  10  11  12   three
    
    In [196]: type(frame)
    Out[196]: pandas.core.frame.DataFrame
    
    In [197]: hdf5store = pd.HDFStore('mydata.h5')
    
    In [198] %ls mydata.h5
     Volume in drive C is OS
     Volume Serial Number is 5B75-665D
    
     Directory of C:\Users\tn\Documents\python\pydata
    
    09/02/2015  12:41 PM                 0 mydata.h5
                   1 File(s)              0 bytes
                   0 Dir(s)  300,651,331,584 bytes free
    
    In [199]: hd5store['frame'] = frame
    
    In [200]: hdf5store
    Out[200]: 
    <class 'pandas.io.pytables.HDFStore'>
    File path: mydata.h5
    /frame            frame        (shape->[3,5])
    
    In [201]: list(hdf5store.items())
    Out[201]: 
    [('/frame', /frame (Group) ''
        children := ['block0_values' (Array), 'block0_items' (Array), 'axis1' (Array), 'block1_items' (Array), 'axis0' (Array), 'block1_values' (VLArray)])]
    
    In [202]: hdf5store.close()
    

    Now demonstrate ability to retrieve frame from mydata.h5:

    In [203]: hdf5store2 = pd.HDFStore('mydata.h5')
    
    In [204]: list(hdf5store2.items())
    Out[204]: 
    [('/frame', /frame (Group) ''
        children := ['block0_values' (Array), 'block0_items' (Array), 'axis1' (Array), 'block1_items' (Array), 'axis0' (Array), 'block1_values' (VLArray)])]
    
    In [205]: framecopy = hdf5store2['frame']
    
    In [206]: framecopy
    Out[206]: 
       a   b   c   d message
    0  1   2   3   4     one
    1  5   6   7   8     two
    2  9  10  11  12   three
    
    In [207]: framecopy == frame
    Out[207]: 
          a     b     c     d message
    0  True  True  True  True    True
    1  True  True  True  True    True
    2  True  True  True  True    True
    
    In [208]: hdf5store2.close()