Suppose you have a tensorflow dataset that has values and labels. In my case I created it from a time series as:
f = pd.read_csv('MY.csv', index_col=0, parse_dates=True)
#extract the column we are interested in
single_col = df[['Close']]
#Convert to TFDataset
WINDOW_SIZE = 10
dataset = tf.data.Dataset.from_tensor_slices((single_col_df.values))
d = dataset.window(WINDOW_SIZE, shift=1, drop_remainder=True)
d2 = d.flat_map(lambda window: window.batch(WINDOW_SIZE+1))
#create data and ground truth
d3 = d2.map(lambda window: (window[:-1], window[-1:]))
#get the total data and shuffle
len_ds = 0
for item in d2:
len_ds +=1
d_shuffled = d3.shuffle(buffer_size=len_ds)
# split train/test
train_size = int(0.7 * len_ds)
val_size = int(0.15 * len_ds)
test_size = int(0.15 * len_ds)
train_dataset = d_shuffled.take(train_size)
test_dataset = d_shuffled.skip(train_size)
val_dataset = test_dataset.skip(test_size)
test_dataset = test_dataset.take(test_size)
train_dataset = train_dataset.batch(32).prefetch(2)
val_dataset = val_dataset.batch(32)
Now for evaluation purposes I want to get the ground truth values of the test, so I am running
y = np.concatenate([y for x, y in test_dataset], axis=0)
but this is returning each time an array differently sorted, so it cannot be compared with the models predicted by the model. For example when running the above line in jupyter notebook and printing the first 5 values of y
as `y[:5], one time I get
array([[26.04000092],
[16.39999962],
[18.98999977],
[42.31000137],
[19.82999992]])
another I get
array([[15.86999989],
[43.27999878],
[19.32999992],
[48.38000107],
[17.12000084]])
but the length of y
remains the same so I am assuming that the elements are just shuffled around. Anyway with this I cannot compared these values with the predicted ones, since their order is different :
y_hat = model.predict(test_dataset)
Furthermore, I get also different evaluation results. For example,
x = []
y = []
for _x,_y in test_dataset:
x.append(_x)
y.append(_y)
x = np.array(x)
y = np.array(y)
model.evaluate(x=x, y=y)
each time the loop defining the arrays x
and y
is re-executed, I get different x
and y
arrays that result in a different evaluation result.
by calling shuffle
on the whole dataset before splitting it, you actually reshuffle the dataset after each exhaustion of the dataset. Here is what is happening:
y = np.concatenate([y for x, y in test_dataset], axis=0)
will exhaust the test datasety = np.concatenate([y for x, y in test_dataset], axis=0)
will see that test_dataset is exhausted, and will trigger:
You end up with potentially samples of your train dataset of the first exhaustion in the test dataset of the second round.
If we look at the documentation of tf.data.Dataset.suffle
:
reshuffle_each_iteration (Optional.) A boolean, which if true indicates that the dataset should be pseudorandomly reshuffled each time it is iterated over. (Defaults to True.)
Set it to false to have a deterministic shuffle. If you still want to shuffle your training set each epoch, you need to call shuffle on the train set.
import tensorflow as tf
tf.random.set_seed(0) # reproducibility
a = tf.range(10)
ds = tf.data.Dataset.from_tensor_slices(a)
ds_shuffled = ds.shuffle(10,reshuffle_each_iteration=False)
ds_train = ds_shuffled.take(7)
ds_train = ds_train.shuffle(7)
ds_test = ds_shuffled.skip(7)
Running it :
>>> [x.numpy() for x in ds_test]
[5, 8, 4]
>>> [x.numpy() for x in ds_test]
[5, 8, 4]
>>> [x.numpy() for x in ds_train]
[1, 3, 7, 2, 6, 9, 0]
>>> [x.numpy() for x in ds_train]
[3, 9, 6, 7, 2, 1, 0]
Try running it with reshuffle_each_iteration=True
to see what happened in your own code