Search code examples
computer-visiondatasetimage-recognitiontext-recognitiontext-database

problem in reading the images of mjsynth dataset


recently I am trying to train a text recognition network. I tried to start the training by feeding the mjsynth dataset to network. However, there seems to be some images in the dataset which are blank. So, while training, if I directly feed the data to network, it generates the error while reading the image, and because of this error, training stops. Does anyone know the list of the blank images in mjsynth dataset. So that I can remove those blank images from the dataset.


Solution

  • After trying many things, I ended up running a pretty long experiment to read almost 9 million images of the mjsynth dataset and collected images which are currupted or are blank. I found that theren are 12 currupted images which stops the model training when the mjsynth data is directly fed to the model for training without any varification. Here is the code and founded invalid images. So you can remove this images from the mjsynth dataset before starting the model training.

    import os
    import cv2
    import numpy as np
    rootdir = './mjsynth/mnt/ramdisk/max/90kDICT32px'
    
    invalid_images = []
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            im_path = os.path.join(subdir, file)
            im = cv2.imread(im_path)
            if type(im) != np.ndarray:
                invalid_images.append(im_path)
    
    print('invalid_images = {}'.format(invalid_images ))
    
    # output
    invalid_images = 
    ['./mjsynth/mnt/ramdisk/max/90kDICT32px\\1863/4/223_Diligently_21672.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\913/4/231_randoms_62372.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\2025/2/364_SNORTERS_72304.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\495/6/81_MIDYEAR_48332.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\869/4/234_TRIASSIC_80582.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\173/2/358_BURROWING_10395.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\2013/2/370_refract_63890.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\368/4/232_friar_30876.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\1881/4/225_Marbling_46673.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\1817/2/363_actuating_904.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\275/6/96_hackle_34465.jpg',
    './mjsynth/mnt/ramdisk/max/90kDICT32px\\2069/4/192_whittier_86389.jpg']