Search code examples
pythonimage-processinghdf5h5pyhdf5storage

How to compare multiple hdf5 files


I have multiple h5py files(pixel-level annotations) for one image. Image Masks are stored in hdf5 files as key-value pairs with the key being the id of some class. The masks (hdf5 files) all match the dimension of their corresponding image and represent labels for pixels in the image. I need to compare all the h5 files with one another and find out the final mask that represents the majority. But I don't know how to compare multiple h5 files in python. Can someone kindly help?


Solution

  • What do you mean by "compare"?

    If you just want to compare the files to see if they are the same, you can use the h5diff utility from The HDF5 Group. It comes with the HDF5 installer. You can get more info about h5diff here: h5diff utility. Links to all HDF5 utilities are at the top of the page:HDF5 Tools

    It sounds like you need to do more that that. Please clarify what you mean by "find out the final mask that represents the majority". Do you want to find the average image size (either mean, median, or mode)? If so, it is "relatively straight-forward" (if you know Python) to open each file and get the dimension of the image data (the shape of each dataset -- what you call the values). For reference, the key, value terminology is how h5py refers to HDF5 dataset names and datasets.

    Here is a basic outline of the process to open 1 HDF5 file and loop thru the datasets (by key name) to get the dataset shape (image size). For multiple files, you can add a for loop using the iglob iterator to get the HDF5 file names. For simplicity, I saved the shape values to 3 lists and manually calculated the mean (sum()/len()). If you want to calculate the mask differently, I suggest using NumPy arrays. It has mean and median functions built-in. For mode, you need scipy.stats module (it works on NumPy arrays).

    Method 1: iterates on .keys()

    s0_list = []
    s1_list = []
    s2_list = []    
    with h5py.File(filename,'r')as h5f:
        for name in h5f.keys() :
            shape = h5f[name].shape
            s0_list.append(shape[0])
            s1_list.append(shape[1])
            s2_list.append(shape[2])
        
    print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
    print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
    print ('Ave len axis=2:',sum(s2_list)/len(s2_list))
    

    Method 2: iterates on .items()

    s0_list = []
    s1_list = []
    s2_list = []    
    with h5py.File(filename,'r')as h5f:
        for name, ds in h5f.items() :
            shape = ds.shape
            s0_list.append(shape[0])
            s1_list.append(shape[1])
            s2_list.append(shape[2])
        
    print ('Ave len axis=0:',sum(s0_list)/len(s0_list))
    print ('Ave len axis=1:',sum(s1_list)/len(s1_list))
    print ('Ave len axis=2:',sum(s2_list)/len(s2_list))