I want some more control over the TensorFlow dataset generation. For this reason, I want to mirror the behavior of timeseries_dataset_from_array but with the ability to use consecutive windows or non-overlapping windows (not possible with timeseries_dataset_from_array to set sequence_stride=0).
# df_with_inputs = (x, 19) df_with_labels = (x,1)
ds = tf.data.Dataset.from_tensor_slices((df_with_inputs.values, df_with_labels.values)).window(20, shift=1, stride=1, drop_remainder=True).batch(32)
equals to:
ds = tf.keras.preprocessing.timeseries_dataset_from_array(df_with_inputs[df_with_inputs.columns], df_with_labels[df_with_labels.columns], sequence_length=window_size,sequence_stride=1,shuffle=False,batch_size=batch_size)
both create a BatchDataset with the same amount of samples, but the type-spec of the dataset with the manual method is somehow different, e.g., first, give me:
<BatchDataset shapes: (DatasetSpec(TensorSpec(shape=(19,), dtype=tf.float32, name=None), TensorShape([None])), DatasetSpec(TensorSpec(shape=(1,), dtype=tf.float32, name=None), TensorShape([None]))), types: (DatasetSpec(TensorSpec(shape=(19,), dtype=tf.float32, name=None), TensorShape([None])), DatasetSpec(TensorSpec(shape=(1,), dtype=tf.float32, name=None), TensorShape([None])))>
where the last one give me:
<BatchDataset shapes: ((None, None, 19), (None, 1)), types: (tf.float64, tf.int32)>
. But both contain the same amount of elements, in my case, 3063. Note that stride and sequence_stride have different behavior in both methods (for the same behavior, you need shift=1). Additionally, when I try to feed the first to my NN, I receive the following error (where the ds of timeseries_dataset_from_array works like a charm):
TypeError: Inputs to a layer should be tensors.
Any idea what I am missing here?
My model:
input_shape = (window_size, num_features) #(20,19)
model = tf.keras.Sequential([
tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', padding="same",
input_shape=input_shape), [....]])
The equivalent of this:
import tensorflow as tf
tf.random.set_seed(345)
samples = 30
df_with_inputs = tf.random.normal((samples, 2), dtype=tf.float32)
df_with_labels = tf.random.uniform((samples, 1), maxval=2, dtype=tf.int32)
batch_size = 2
window_size = 20
ds1 = tf.keras.preprocessing.timeseries_dataset_from_array(df_with_inputs, df_with_labels, sequence_length=window_size,sequence_stride=1,shuffle=False, batch_size=batch_size)
for x, y in ds1.take(1):
print(x, y)
tf.Tensor(
[[[-0.01898661 1.2348452 ]
[-0.33379436 -0.13637085]
[-2.239644 1.5407541 ]
[-0.14988706 0.50577176]
[-1.6328571 -0.9512018 ]
[-3.0481005 0.8019097 ]
[-0.683125 -0.12166552]
[-0.5408724 -0.97584397]
[ 0.47595206 1.0512688 ]
[ 0.15297593 0.7393363 ]
[-0.17052855 -0.12541457]
[ 1.1617764 -2.491248 ]
[-2.5665069 0.9241422 ]
[ 0.40681016 -1.031384 ]
[-0.23945935 1.5275828 ]
[-1.3431666 0.2940185 ]
[ 1.7351524 0.34276873]
[ 0.8059861 2.0647929 ]
[-0.3017126 0.729208 ]
[-0.8672192 -0.79938954]]
[[-0.33379436 -0.13637085]
[-2.239644 1.5407541 ]
[-0.14988706 0.50577176]
[-1.6328571 -0.9512018 ]
[-3.0481005 0.8019097 ]
[-0.683125 -0.12166552]
[-0.5408724 -0.97584397]
[ 0.47595206 1.0512688 ]
[ 0.15297593 0.7393363 ]
[-0.17052855 -0.12541457]
[ 1.1617764 -2.491248 ]
[-2.5665069 0.9241422 ]
[ 0.40681016 -1.031384 ]
[-0.23945935 1.5275828 ]
[-1.3431666 0.2940185 ]
[ 1.7351524 0.34276873]
[ 0.8059861 2.0647929 ]
[-0.3017126 0.729208 ]
[-0.8672192 -0.79938954]
[-0.14423785 0.95039433]]], shape=(2, 20, 2), dtype=float32) tf.Tensor(
[[1]
[1]], shape=(2, 1), dtype=int32)
Using tf.data.Dataset.from_tensor_slices
would be this:
ds2 = tf.data.Dataset.from_tensor_slices((df_with_inputs, df_with_labels)).batch(batch_size)
inputs_only_ds = ds2.map(lambda x, y: x)
inputs_only_ds = inputs_only_ds.flat_map(tf.data.Dataset.from_tensor_slices).window(window_size, shift=1, stride=1, drop_remainder=True).flat_map(lambda x: x.batch(window_size)).batch(batch_size)
ds2 = tf.data.Dataset.zip((inputs_only_ds, ds2.map(lambda x, y: y)))
for x, y in ds2.take(1):
print(x, y)
tf.Tensor(
[[[-0.01898661 1.2348452 ]
[-0.33379436 -0.13637085]
[-2.239644 1.5407541 ]
[-0.14988706 0.50577176]
[-1.6328571 -0.9512018 ]
[-3.0481005 0.8019097 ]
[-0.683125 -0.12166552]
[-0.5408724 -0.97584397]
[ 0.47595206 1.0512688 ]
[ 0.15297593 0.7393363 ]
[-0.17052855 -0.12541457]
[ 1.1617764 -2.491248 ]
[-2.5665069 0.9241422 ]
[ 0.40681016 -1.031384 ]
[-0.23945935 1.5275828 ]
[-1.3431666 0.2940185 ]
[ 1.7351524 0.34276873]
[ 0.8059861 2.0647929 ]
[-0.3017126 0.729208 ]
[-0.8672192 -0.79938954]]
[[-0.33379436 -0.13637085]
[-2.239644 1.5407541 ]
[-0.14988706 0.50577176]
[-1.6328571 -0.9512018 ]
[-3.0481005 0.8019097 ]
[-0.683125 -0.12166552]
[-0.5408724 -0.97584397]
[ 0.47595206 1.0512688 ]
[ 0.15297593 0.7393363 ]
[-0.17052855 -0.12541457]
[ 1.1617764 -2.491248 ]
[-2.5665069 0.9241422 ]
[ 0.40681016 -1.031384 ]
[-0.23945935 1.5275828 ]
[-1.3431666 0.2940185 ]
[ 1.7351524 0.34276873]
[ 0.8059861 2.0647929 ]
[-0.3017126 0.729208 ]
[-0.8672192 -0.79938954]
[-0.14423785 0.95039433]]], shape=(2, 20, 2), dtype=float32) tf.Tensor(
[[1]
[1]], shape=(2, 1), dtype=int32)
Note that flap_map
is necessary to flatten the tensor in order to apply sliding windows more easily. The function flat_map(lambda x: x.batch(window_size))
simply creates batches of the flattened tensor after applying sliding windows.
With the line inputs_only_ds = ds2.map(lambda x, y: x)
I extract only the data (x) without the labels (y) to run sliding windows. Afterwards, in tf.data.Dataset.zip((inputs_only_ds, ds2.map(lambda x, y: y)))
, I concatenate / zip the dataset with the sliding windows and the labels (y) resulting in the final result ds2
.