Search code examples
pythontensorflowtensorflow-datasets

How to extract data without label from tensorflow dataset


I have a tf dataset called train_ds:

directory = 'Data/dataset_train'

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  directory,
  validation_split=0.2,
  subset="training",
    color_mode='grayscale',
  seed=123,
  image_size=(28, 28),
  batch_size=32)

This dataset is composed of 20000 images of "Fake" images and 20000 "Real" images and I want to extract X_train and y_train in numpy form from this tf dataset but I have only managed to get the labels out with

y_train = np.concatenate([y for x, y in train_ds], axis=0)

I also tried with this but it doesn't seem like it's iterating through the 20000 images:

for images, labels in train_ds.take(-1):  
    X_train = images.numpy()
    y_train = labels.numpy()

I really want to extract the images to X_train and the labels to y_train but I can't figure it out! I apologize in advance for any mistake I've made and appreciate all the help I can get :)


Solution

  • If you did not apply further transformations to the dataset it will be a BatchDataset. You can create two lists to iterate over dataset. Here in total I have 2936 images.

    x_train, y_train = [], []
    
    for images, labels in train_ds:
      x_train.append(images.numpy())
      y_train.append(labels.numpy())
    
    np.array(x_train).shape >> (92,)
    

    It was generating batches. You can use np.concatenate to concat them.

    x_train = np.concatenate(x_train, axis = 0) 
    x_train.shape >> (2936,28,28,3)
    

    Or you can unbatch the dataset and iterate over it:

    for images, labels in train_ds.unbatch():
      x_train.append(images.numpy())
      y_train.append(labels.numpy())
    
    x_train = np.array(x_train)
    x_train.shape >> (2936,28,28,3)