Search code examples
pythontensorflowmathtensorflow-datasets

what does cardinality mean in relation to an image dataset?


After successfully creating a tensorflow image Dataset with:

dataset = tf.keras.utils.image_dataset_from_directory(...)

which returns

Found 21397 files belonging to 5 classes. Using 17118 files for training.

There is the cardinality method:

dataset.cardinality()

which returns a tensor containing the single value

tf.Tensor(535, shape=(), dtype=int64)

I've read the docs here but I don't understand what 535 represents or why its different to the number of files?

I ask, because I would like to understand how cardinality plays into this equation:

steps_per_epoch = dataset.cardinality().numpy() // batch_size


Solution

  • The cardinality, in your case, is simply the rounded number of batches:

    import tensorflow as tf
    import pathlib
    
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
    data_dir = pathlib.Path(data_dir)
    
    batch_size = 32
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="training",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size)
    
    print(train_ds.cardinality())
    
    Found 3670 files belonging to 5 classes.
    Using 2936 files for training.
    tf.Tensor(92, shape=(), dtype=int64)
    

    The equation is: 2936/32 = cardinality, so it depends on your batch size.