Search code examples
pythonnumpyhdf5h5py

How to store my own class object into hdf5?


I created a class to hold experiment results from my research (I'm an EE phd student) like

class Trial:
    def __init__(self, subID, triID):
        self.filePath = '' # file path of the folder
        self.subID = -1    # int
        self.triID = -1    # int
        self.data_A = -1   # numpy array
        self.data_B = -1   # numpy array
        ......

It's a mix of many bools, int, and numpy arrays. You get the idea. I read that it is faster when loading if the data is in hdf5 format. Can I do it with my data, which is a python list of my Trial object?

Note that there is a similar question on stackoverflow. But it only has one answer, which doesn't answer the question. Instead, it breaks down the OP's custom class into basic data types and store them into individual datasets. I'm not against doing that, but I want to know if it's the only way because it's against the philosophy of object oriented.


Solution

  • Here's a small class that I use for saving data like this. You can use it by doing something like..

    dc = DataContainer()
    dc.trials = <your list of trial objects here>
    dc.save('mydata.pkl')
    

    Then to load do..

    dc = DataContainer.load('mydata.pkl')
    

    Here's the DataContainer file:

    import gzip
    import cPickle as pickle
    
    # Simple container with load and save methods.  Declare the container
    # then add data to it.  Save will save any data added to the container.
    # The class automatically gzips the file if it ends in .gz
    #
    # Notes on size and speed (using UbuntuDialog data)
    #       pkl     pkl.gz
    # Save  11.4s   83.7s
    # Load   4.8s   45.0s
    # Size  596M    205M
    #
    class DataContainer(object):
        @staticmethod
        def isGZIP(filename):
            if filename.split('.')[-1] == 'gz':
                return True
            return False
    
        # Using HIGHEST_PROTOCOL is almost 2X faster and creates a file that
        # is ~10% smaller.  Load times go down by a factor of about 3X.
        def save(self, filename='DataContainer.pkl'):
            if self.isGZIP(filename):
                f = gzip.open(filename, 'wb')
            else:
                f = open(filename, 'wb')
            pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
            f.close()
    
        # Note that loading to a string with pickle.loads is about 10% faster
        # but probaly comsumes a lot more memory so we'll skip that for now.
        @classmethod
        def load(cls, filename='DataContainer.pkl'):
            if cls.isGZIP(filename):
                f = gzip.open(filename, 'rb')
            else:
                f = open(filename, 'rb')
            n = pickle.load(f)
            f.close()
            return n
    

    Depending on your use case you could use this as described at the top, as a base class, or simply copy the pickle.dump line into your code.

    If you really have a lot of data and you don't use all of it with every run of your test program, there are a few other options such a database but the above is about the best simple option assuming you need most of the data with each run.