Search code examples
pythontensorflowkerastensorflow-datasetskaggle

Convert folder of images with labels in CSV file into a tensorflow Dataset


This clothing dataset (from Kaggle) when downloaded looks something like the below:

  • Labels inside a .csv file
  • Images in a subdirectory
+-dataset/
  |
  +-images.csv
  |
  +-images/
    |
    +-d7ed1d64-2c65-427f-9ae4-eb4aaa3e2389.jpg
    |
    +-5c1b7a77-1fa3-4af8-9722-cd38e45d89da.jpg
    |
    +-... <additional files>

I would like to load this into a tensorflow Dataset (version: tensorflow~=2.4).

Is there some way I can convert this directory of images with labels in a separate .csv into a tf.Dataset?

Tensorflow load image dataset with image labels suggests ImageDataGenerator.flow_from_dataframe, but this is now deprecated :/


Solution

  • Based on the answers:

    I have DIY created the following. I am sure there is a simpler way, but this at least is something functional. I was hoping for more built-in support though:

    import os.path
    from typing import Dict, Tuple
    
    import pandas as pd
    import tensorflow as tf
    
    
    def get_full_dataset(
        batch_size: int = 32, image_size: Tuple[int, int] = (256, 256)
    ) -> tf.data.Dataset:
        data = pd.read_csv(os.path.join(DATA_ABS_PATH, "images.csv"))
        images_path = os.path.join(DATA_ABS_PATH, "images")
        data["image"] = data["image"].map(lambda x: os.path.join(images_path, f"{x}.jpg"))
        filenames: tf.Tensor = tf.constant(data["image"], dtype=tf.string)
        data["label"] = data["label"].str.lower()
        class_name_to_label: Dict[str, int] = {
            label: i for i, label in enumerate(set(data["label"]))
        }
        labels: tf.Tensor = tf.constant(
            data["label"].map(class_name_to_label.__getitem__), dtype=tf.uint8
        )
        dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
    
        def _parse_function(filename, label):
            jpg_image: tf.Tensor = tf.io.decode_jpeg(tf.io.read_file(filename))
            return tf.image.resize(jpg_image, size=image_size), label
    
        dataset = dataset.map(_parse_function)
        return dataset.batch(batch_size)