Search code examples
pythonnumpytensorflowtensorflow-datasets

Consistent extraction of data from tensorflow dataset


I want to extract the data from a tensorflow dataset consistently into numpy arrays/tensors. I'm loading pictures witn

data = keras.preprocessing.image_dataset_from_directory(
  './data', 
  labels='inferred', 
  label_mode='binary', 
  validation_split=0.2, 
  subset="training", 
  image_size=(img_height, img_width), 
  batch_size=sz_batch, 
  crop_to_aspect_ratio=True
)

I already got the hint to use the following lines:

xdata = np.concatenate([x for x, y in data], axis=0)
ydata = np.concatenate([y for x, y in data], axis=0)

The problem is, however, that the extracted data in xdata and ydata is not consistent, hence, the labels in ydata don't fit to the samples in xdata (I checked this by simply looping through the extracted data).

My second idea was to extract the data in a standard for loop:

xdata = np.empty([sz1, sz2, 3])[np.newaxis,...]
ydata = np.array([0])
for images, labels in val_ds:
    xdata = np.concatenate((xdata, images), axis=0)
    ydata = np.concatenate((ydata, labels), axis=0)

xdata = xval[1:]
ydata = yval[1:]

Even though the data seem to be consistent with this approach, I think this way is quite cumbersome and it's also not nicely (and supposingly also not efficiently) written - especially the last two lines bother me. But I wasn't able to come up with an easier way to extract the data and stack the extracted data together in numpy arrays/tensors.

I'd be greatful for help how to solve this problem properly in python.

Anyway, I'm wondering why the handling with the tensorflow dataset is, at least in my opinion, really cumbersome. Firstly, I need to solve the problem stated above to use the data in other routines rather than in tensorflow. Secondly, even if I use the data anywhere else than for training in tensorflow, it's not really straightforward in my optinion. E.g. if I want to compare predicted labels from a NN with the true labels from a dataset I can't easily extract the consistent labels of this dataset. I have to predict each sample separately in a for loop.

Note: I won't/can't use tfds


Solution

  • Regarding the order of your dataset when converting it to numpy arrays, make sure you set shuffle=False in image_dataset_from_directory if you want to see the same results:

    import tensorflow as tf
    import matplotlib.pyplot as plt
    import pathlib
    
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
    data_dir = pathlib.Path(data_dir)
    
    batch_size = 32
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="training",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size,
      shuffle=False)
    
    normalization_layer = tf.keras.layers.Rescaling(1./255)
    train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
    images, labels = next(iter(train_ds.take(1)))
    image = images[0]
    plt.title('label :: ' + str(labels[0]))
    plt.imshow(image.numpy())
    

    enter image description here

    Afterwards, you can try a couple methods to convert your dataset into list or array-like structures:

    Option 1:

    train_ds = train_ds.unbatch()
    data = list(train_ds.map(lambda x, y: (x, y)))
    data = list(map(list, zip(*data)))
    images, labels = data[0], data[1]
    
    image = images[0]
    plt.title('label :: ' + str(labels[0]))
    plt.imshow(image.numpy())
    

    Option 2:

    import numpy as np
    
    train_ds = train_ds.unbatch()
    images = np.asarray(list(train_ds.map(lambda x, y: x)))
    labels = np.asarray(list(train_ds.map(lambda x, y: y)))
    image = images[0]
    plt.title('label :: ' + str(labels[0]))
    plt.imshow(image)
    

    Option 3:

    import numpy as np
    
    # no unbatching
    images = np.concatenate(list(train_ds.map(lambda x, y: x)))
    labels = np.concatenate(list(train_ds.map(lambda x, y: y)))
    
    image = images[0]
    plt.title('label :: ' + str(labels[0]))
    plt.imshow(image)
    

    All options will maintain the order of your data:

    enter image description here

    Update 1: You could also try using tf.TensorArray and set shuffle=True:

    images = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
    labels = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)
    
    for x, y in train_ds.unbatch():
      images = images.write(images.size(), x)
      labels = labels.write(labels.size(), y)
    
    images = tf.stack(images.stack(), axis=0)
    labels = tf.stack(labels.stack(), axis=0)