python numpy tensorflow tensorflow-datasets

Consistent extraction of data from tensorflow dataset

I want to extract the data from a tensorflow dataset consistently into numpy arrays/tensors. I'm loading pictures witn

data = keras.preprocessing.image_dataset_from_directory(
  './data', 
  labels='inferred', 
  label_mode='binary', 
  validation_split=0.2, 
  subset="training", 
  image_size=(img_height, img_width), 
  batch_size=sz_batch, 
  crop_to_aspect_ratio=True
)

I already got the hint to use the following lines:

xdata = np.concatenate([x for x, y in data], axis=0)
ydata = np.concatenate([y for x, y in data], axis=0)

The problem is, however, that the extracted data in xdata and ydata is not consistent, hence, the labels in ydata don't fit to the samples in xdata (I checked this by simply looping through the extracted data).

My second idea was to extract the data in a standard for loop:

xdata = np.empty([sz1, sz2, 3])[np.newaxis,...]
ydata = np.array([0])
for images, labels in val_ds:
    xdata = np.concatenate((xdata, images), axis=0)
    ydata = np.concatenate((ydata, labels), axis=0)

xdata = xval[1:]
ydata = yval[1:]

Even though the data seem to be consistent with this approach, I think this way is quite cumbersome and it's also not nicely (and supposingly also not efficiently) written - especially the last two lines bother me. But I wasn't able to come up with an easier way to extract the data and stack the extracted data together in numpy arrays/tensors.

I'd be greatful for help how to solve this problem properly in python.

Anyway, I'm wondering why the handling with the tensorflow dataset is, at least in my opinion, really cumbersome. Firstly, I need to solve the problem stated above to use the data in other routines rather than in tensorflow. Secondly, even if I use the data anywhere else than for training in tensorflow, it's not really straightforward in my optinion. E.g. if I want to compare predicted labels from a NN with the true labels from a dataset I can't easily extract the consistent labels of this dataset. I have to predict each sample separately in a for loop.

Note: I won't/can't use tfds

Solution

Regarding the order of your dataset when converting it to numpy arrays, make sure you set shuffle=False in image_dataset_from_directory if you want to see the same results:

import tensorflow as tf
import matplotlib.pyplot as plt
import pathlib

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

batch_size = 32

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(180, 180),
  batch_size=batch_size,
  shuffle=False)

normalization_layer = tf.keras.layers.Rescaling(1./255)
train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
images, labels = next(iter(train_ds.take(1)))
image = images[0]
plt.title('label :: ' + str(labels[0]))
plt.imshow(image.numpy())

Afterwards, you can try a couple methods to convert your dataset into list or array-like structures:

Option 1:

train_ds = train_ds.unbatch()
data = list(train_ds.map(lambda x, y: (x, y)))
data = list(map(list, zip(*data)))
images, labels = data[0], data[1]

image = images[0]
plt.title('label :: ' + str(labels[0]))
plt.imshow(image.numpy())

Option 2:

import numpy as np

train_ds = train_ds.unbatch()
images = np.asarray(list(train_ds.map(lambda x, y: x)))
labels = np.asarray(list(train_ds.map(lambda x, y: y)))
image = images[0]
plt.title('label :: ' + str(labels[0]))
plt.imshow(image)

Option 3:

import numpy as np

# no unbatching
images = np.concatenate(list(train_ds.map(lambda x, y: x)))
labels = np.concatenate(list(train_ds.map(lambda x, y: y)))

image = images[0]
plt.title('label :: ' + str(labels[0]))
plt.imshow(image)

All options will maintain the order of your data:

Update 1: You could also try using tf.TensorArray and set shuffle=True:

images = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
labels = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)

for x, y in train_ds.unbatch():
  images = images.write(images.size(), x)
  labels = labels.write(labels.size(), y)

images = tf.stack(images.stack(), axis=0)
labels = tf.stack(labels.stack(), axis=0)