Search code examples
pythontensorflowtensorflow-datasets

Split train data to train and validation by using tensorflow_datasets.load (TF 2.1)


I'm trying to run the following Colab project, but when I want to split the training data into validation and train parts I get this error:

KeyError: "Invalid split train[:70%]. Available splits are: ['train']"

I use the following code:

(training_set, validation_set), dataset_info = tfds.load(
'tf_flowers',
split=['train[:70%]', 'train[70%:]'],
with_info=True,
as_supervised=True,
)

How I can fix this error?


Solution

  • According to the Tensorflow Dataset docs the approach you presented is now supported. Splitting is possible by passing split parameter to tfds.load like so split="test[:70%]".

    (training_set, validation_set), dataset_info = tfds.load(
        'tf_flowers',
        split=['train[:70%]', 'train[70%:]'],
        with_info=True,
        as_supervised=True,
    )
    

    With the above code the training_set has 2569 entries, while validation_set has 1101.

    Thank you Saman for the comment on API deprecation:
    In previous Tensorflow version it was possible to use tfds.Split API which is now deprecated:

    (training_set, validation_set), dataset_info = tfds.load(
        'tf_flowers',
        split=[
            tfds.Split.TRAIN.subsplit(tfds.percent[:70]),
            tfds.Split.TRAIN.subsplit(tfds.percent[70:])
        ],
        with_info=True,
        as_supervised=True,
    )