Search code examples
pythontensorflowmat-file

make tensorflow dataset from huge number of images(*.jpg) and labels(*.mat)


I have a huge number of images with their labels (.mat) file (cannot use tf.data.Dataset.from_tensor_slices()) and I want to use tf.data API to make a tensorflow dataset out of it.

As I read in the documentation, I can use tf.data.TextLineDataset for large number of data(I have to have a txt file with the address of all the images and send the path of the txt file as tf.data.TextLineDataset argument). Then, I can use map method to read txt file (tf.read_file) decode jpg image (tf.image.decode_jpeg) and do some basic transformation on the image.

However, I cannot use scipy.io.loadmat in any part of map method because I have no string indicating the path to the mat file. All I have is tf.Tensor.

I don't think that reading all images and making a TFRecord out of it is that much efficient in this case because then I am basically doing every thing two times. Once, reading the whole images and making TFRecord, and once again, reading TFRecord to make tensorflow dataset.

Any idea how I can resolve this issue?

This is my code:

dataset = tf.data.TextLineDataset(txt_file).map(read_img_and_mat)

and then:

def read_img_and_mat(path):
    image_string = tf.read_file(path)
    image_decoded = tf.image.decode_jpeg(image_string, channels=3)
    label = ... # get label from mat file
    return image_decoded, label

Solution

  • I found a way to do it using tf.data.from_generator The trick I found was to make two separate Dataset (one for mat file and one for the jpg file) and then to combine them using tf.data.Dataset.zip

    Here is how it works:

    mat_dataset = tf.data.Dataset.from_generator(read_mat_file, tf.int64)
    
    def read_mat_file():
        while True:
            with open('mat_addresses.txt', 'r') as input_:
                for line in input_:
                    # open file and extract info from it as np.array
                    yield tuple(label)  # why tuple? https://github.com/tensorflow/tensorflow/issues/13101
    

    in order to get the next batch one just have to do:

    iter = mat_dataset.make_one_shot_iterator()
    sess.run(iter.get_next())
    

    however, one can make img_dataset and combine it with mat_dataset like this:

    img_dataset = tf.data.TextLineDataset('img_addresses.txt').map(read_img)
    
    def read_img(path):
        image_string = tf.read_file(path)
        image_decoded = tf.image.decode_jpeg(image_string, channels=3)
        return image_decoded
    
    dataset = tf.data.Dataset.zip((mat_dataset, img_dataset))
    

    and now, getting next next batch like mentioned.

    PS. I have no idea about how efficient the code works in comparison to feed_dict