Search code examples
pythontensorflowtensorflow-datasets

How to split data into x_train and y_train


I'm trying to access EMNIST data from here:

https://www.tensorflow.org/datasets/splits

with this code:

train_ds, test_ds = tfds.load('emnist', split=['train', 'test'], shuffle_files=True)

I tried doing this:

x_train = train_ds['image']
y_train = train_ds['label']
x_test = test_ds['image']
y_test = test_ds['label']

But I get the error TypeError: 'PrefetchDataset' object is not subscriptable

When I try to print train_ds it prints

<PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

I want to separate the image and the label into x_train, y_train, x_test, y_test like how you would for mnist data from keras.

I see from here: https://www.tensorflow.org/datasets/catalog/emnist that the structure for the feature is

FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=uint8),
    'label': ClassLabel(shape=(), dtype=int64, num_classes=47),
})

But like I'm not sure how to extract it :C


Solution

  • If you just want to split your dataset but keep them as tf.data.Datasets, you could run (recommendable):

    import tensorflow as tf
    import tensorflow_datasets as tfds
    
    train_ds, test_ds = tfds.load('emnist', split=['train', 'test'], shuffle_files=True)
    
    x_train = train_ds.map(lambda i: i['image'])
    y_train = train_ds.map(lambda l: l['label'])
    x_test = test_ds.map(lambda x: x['image'])
    y_test = test_ds.map(lambda y: y['label'])
    

    You could also convert the datasets to numpy arrays, but it could take a while (~ 6 min on Colab):

    import numpy as np
    
    x_train = np.array(list(train_ds.map(lambda i: i['image'])))
    y_train = np.array(list(train_ds.map(lambda l: l['label'])))
    x_test = np.array(list(test_ds.map(lambda x: x['image'])))
    y_test = np.array(list(test_ds.map(lambda y: y['label'])))