Search code examples
opencvhdf5h5py

Convert a folder comprising jpeg images to hdf5


Is there a way to convert a folder comprising .jpeg images to hdf5 in Python? I am trying to build a neural network model for classification of images. Thanks!


Solution

  • There are a lot of ways to process and save image data. Here are 2 variations of a method that reads all of the image files in 1 folder and loads into a HDF5 file. Outline of this process:

    1. Count the number of images (used to size the dataset).
    2. Create HDF5 file (prefixed: 1ds_)
    3. Create empty dataset with appropriate shape and type (integers)
    4. Use glob.iglob() to loop over images. Then do:
      • Read with cv2.imread()
      • Resize with cv2.resize()
      • Copy to the dataset img_ds[cnt:cnt+1:,:,:]

    This is ONE way to do it. Additional things to consider:

    1. I loaded all images in 1 dataset. If you you have different size images, you must resize the images. If you don't want to resize, you need to save each image in a different dataset (same process, but create a new dataset inside the loop). See the second with/as: and loop that saves the data to the 2nd HDF5 (prefixed: nds_)
    2. I didn't try to capture image names. You could do that with attributes on 1 dataset, or as the dataset name with multiple datasets.
    3. My images are .ppm files, so you need to modify the glob functions to use *.jpg.

    Simpler Version Below (added Mar 16 2021):
    Assumes all files are in the current folder, AND loads all resized images to one dataset (named 'images'). See previous code for the second method that loads each image in separate dataset without resizing.

    import sys
    import glob
    import h5py
    import cv2
    
    IMG_WIDTH = 30
    IMG_HEIGHT = 30
    
    h5file = 'import_images.h5'
    
    nfiles = len(glob.glob('./*.ppm'))
    print(f'count of image files nfiles={nfiles}')
    
    # resize all images and load into a single dataset
    with h5py.File(h5file,'w') as  h5f:
        img_ds = h5f.create_dataset('images',shape=(nfiles, IMG_WIDTH, IMG_HEIGHT,3), dtype=int)
        for cnt, ifile in enumerate(glob.iglob('./*.ppm')) :
            img = cv2.imread(ifile, cv2.IMREAD_COLOR)
            # or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
            img_resize = cv2.resize( img, (IMG_WIDTH, IMG_HEIGHT) )
            img_ds[cnt:cnt+1:,:,:] = img_resize
    

    Previous Code Below (from Mar 15 2021):

    import sys
    import glob
    import h5py
    import cv2
    
    IMG_WIDTH = 30
    IMG_HEIGHT = 30
    
    # Check command-line arguments
    if len(sys.argv) != 3:
        sys.exit("Usage: python load_images_to_hdf5.py data_directory model.h5")
    
    print ('data_dir =', sys.argv[1])
    data_dir = sys.argv[1]
    print ('Save model to:', sys.argv[2])
    h5file = sys.argv[2]
    
    nfiles = len(glob.glob(data_dir + '/*.ppm'))
    print(f'Reading dir: {data_dir}; nfiles={nfiles}')
    
    # resize all images and load into a single dataset
    with h5py.File('1ds_'+h5file,'w') as  h5f:
        img_ds = h5f.create_dataset('images',shape=(nfiles, IMG_WIDTH, IMG_HEIGHT,3), dtype=int)
        for cnt, ifile in enumerate(glob.iglob(data_dir + '/*.ppm')) :
            img = cv2.imread(ifile, cv2.IMREAD_COLOR)
            # or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
            img_resize = cv2.resize( img, (IMG_WIDTH, IMG_HEIGHT) )
            img_ds[cnt:cnt+1:,:,:] = img_resize
    
    # load each image into a separate dataset (image NOT resized)    
    with h5py.File('nds_'+h5file,'w') as  h5f:
        for cnt, ifile in enumerate(glob.iglob(data_dir + '/*.ppm')) :
            img = cv2.imread(ifile, cv2.IMREAD_COLOR)
            # or use cv2.IMREAD_GRAYSCALE, cv2.IMREAD_UNCHANGED
            img_ds = h5f.create_dataset('images_'+f'{cnt+1:03}', data=img)