Search code examples
pythonmachine-learningscikit-learnsklearn-pandas

train, test, validation splits in tfds.load


so I am asked to implement the split function parameter: 80% train, 10% validation, and 10% test. And I do not understand how to do it here. Please help. Thanks.

def plot_example(x_raw, y_raw):
  fig, axes = plt.subplots(3, 3)
  i = 0
  for i in range(3):
    for j in range(3):
      imgplot = axes[i,j].imshow(x_raw[i*3 + j], cmap = 'bone')
      axes[i,j].set_title(y_raw[i*3 + j])
      axes[i,j].get_yaxis().set_visible(False)
      axes[i,j].get_xaxis().set_visible(False)
  fig.set_size_inches(18.5, 10.5, forward=True)

## TODO: Implement the split function parameter: 80% train, 10% validation, and 10% test.
(ds_train, ds_val, ds_test), ds_info = tfds.load("colorectal_histology", 
                                           split=[],
                                           as_supervised=True, with_info=True)
df = tfds.as_dataframe(ds_train.shuffle(1000).take(1000), ds_info)

plot_example(df['image'], df['label'])
print(ds_info)

Please explain


Solution

  • The tfds.load has the argument of split. You can use this argument to load the dataset in your desired format. If you want 80% train, 10% val, 10% test, you can simply do

    tfds.load(
        colorectal_histology,
        split=["train[20%:]", "train[0%:10%]", "train[10%:20%"],
        as_supervised=True, 
        with_info=True)
    

    Here the 1st argument in split train[10%:] will return the 90% of dataset as training, train[0%:10%] will return the 10% dataset from training as validation, and train[10%:20%] will return the other 10 percent as testing set. Though you can use the complete testing set, but if you want a split as 80,10,10 from training, this is what you can do.

    Read more here