Search code examples
pythontensorflowkerastensorflow-datasets

Extract data from tensorflow dataset (e.g. to numpy)


I'm loading images via

data = keras.preprocessing.image_dataset_from_directory(
  './data', 
  labels='inferred', 
  label_mode='binary', 
  validation_split=0.2, 
  subset="training", 
  image_size=(img_height, img_width), 
  batch_size=sz_batch, 
  crop_to_aspect_ratio=True
)

I want to use the obtained data in non-tensorflow routines too. Therefore, I want to extract the data e.g. to numpy arrays. How can I achieve this? I can't use tfds


Solution

  • I would suggest unbatching your dataset and using tf.data.Dataset.map:

    import numpy as np
    import tensorflow as tf
    
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
    data_dir = pathlib.Path(data_dir)
    batch_size = 32
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="training",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size,
      shuffle=False)
    
    train_ds = train_ds.unbatch()
    images = np.asarray(list(train_ds.map(lambda x, y: x)))
    labels = np.asarray(list(train_ds.map(lambda x, y: y)))
    

    Or as suggested in the comments, you could also try just working with the batches and concatenating them afterwards:

    images = np.concatenate(list(train_ds.map(lambda x, y: x)))
    labels = np.concatenate(list(train_ds.map(lambda x, y: y)))
    

    Or set shuffle=True and use tf.TensorArray:

    images = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
    labels = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)
    
    for x, y in train_ds.unbatch():
      images = images.write(images.size(), x)
      labels = labels.write(labels.size(), y)
    
    images = tf.stack(images.stack(), axis=0)
    labels = tf.stack(labels.stack(), axis=0)