Search code examples
pandasdataframetensorflowtensorflow2.0tensorflow-datasets

Tensorflow Dataset operation equal to timeseries_dataset_from_array possible?


I want some more control over the TensorFlow dataset generation. For this reason, I want to mirror the behavior of timeseries_dataset_from_array but with the ability to use consecutive windows or non-overlapping windows (not possible with timeseries_dataset_from_array to set sequence_stride=0).

 # df_with_inputs = (x, 19) df_with_labels = (x,1)
ds =  tf.data.Dataset.from_tensor_slices((df_with_inputs.values, df_with_labels.values)).window(20, shift=1, stride=1, drop_remainder=True).batch(32)

equals to:

ds = tf.keras.preprocessing.timeseries_dataset_from_array(df_with_inputs[df_with_inputs.columns], df_with_labels[df_with_labels.columns], sequence_length=window_size,sequence_stride=1,shuffle=False,batch_size=batch_size)

both create a BatchDataset with the same amount of samples, but the type-spec of the dataset with the manual method is somehow different, e.g., first, give me:

<BatchDataset shapes: (DatasetSpec(TensorSpec(shape=(19,), dtype=tf.float32, name=None), TensorShape([None])), DatasetSpec(TensorSpec(shape=(1,), dtype=tf.float32, name=None), TensorShape([None]))), types: (DatasetSpec(TensorSpec(shape=(19,), dtype=tf.float32, name=None), TensorShape([None])), DatasetSpec(TensorSpec(shape=(1,), dtype=tf.float32, name=None), TensorShape([None])))>

where the last one give me:

<BatchDataset shapes: ((None, None, 19), (None, 1)), types: (tf.float64, tf.int32)>

. But both contain the same amount of elements, in my case, 3063. Note that stride and sequence_stride have different behavior in both methods (for the same behavior, you need shift=1). Additionally, when I try to feed the first to my NN, I receive the following error (where the ds of timeseries_dataset_from_array works like a charm):

TypeError: Inputs to a layer should be tensors.

Any idea what I am missing here?

My model:

input_shape = (window_size, num_features) #(20,19)
 model = tf.keras.Sequential([
    tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', padding="same",
                           input_shape=input_shape), [....]])

Solution

  • The equivalent of this:

    import tensorflow as tf
    
    tf.random.set_seed(345)
    samples = 30
    df_with_inputs = tf.random.normal((samples, 2), dtype=tf.float32)
    df_with_labels = tf.random.uniform((samples, 1), maxval=2, dtype=tf.int32)
    batch_size = 2
    window_size = 20
    ds1 = tf.keras.preprocessing.timeseries_dataset_from_array(df_with_inputs, df_with_labels, sequence_length=window_size,sequence_stride=1,shuffle=False, batch_size=batch_size)
    for x, y in ds1.take(1):
      print(x, y)
    
    tf.Tensor(
    [[[-0.01898661  1.2348452 ]
      [-0.33379436 -0.13637085]
      [-2.239644    1.5407541 ]
      [-0.14988706  0.50577176]
      [-1.6328571  -0.9512018 ]
      [-3.0481005   0.8019097 ]
      [-0.683125   -0.12166552]
      [-0.5408724  -0.97584397]
      [ 0.47595206  1.0512688 ]
      [ 0.15297593  0.7393363 ]
      [-0.17052855 -0.12541457]
      [ 1.1617764  -2.491248  ]
      [-2.5665069   0.9241422 ]
      [ 0.40681016 -1.031384  ]
      [-0.23945935  1.5275828 ]
      [-1.3431666   0.2940185 ]
      [ 1.7351524   0.34276873]
      [ 0.8059861   2.0647929 ]
      [-0.3017126   0.729208  ]
      [-0.8672192  -0.79938954]]
    
     [[-0.33379436 -0.13637085]
      [-2.239644    1.5407541 ]
      [-0.14988706  0.50577176]
      [-1.6328571  -0.9512018 ]
      [-3.0481005   0.8019097 ]
      [-0.683125   -0.12166552]
      [-0.5408724  -0.97584397]
      [ 0.47595206  1.0512688 ]
      [ 0.15297593  0.7393363 ]
      [-0.17052855 -0.12541457]
      [ 1.1617764  -2.491248  ]
      [-2.5665069   0.9241422 ]
      [ 0.40681016 -1.031384  ]
      [-0.23945935  1.5275828 ]
      [-1.3431666   0.2940185 ]
      [ 1.7351524   0.34276873]
      [ 0.8059861   2.0647929 ]
      [-0.3017126   0.729208  ]
      [-0.8672192  -0.79938954]
      [-0.14423785  0.95039433]]], shape=(2, 20, 2), dtype=float32) tf.Tensor(
    [[1]
     [1]], shape=(2, 1), dtype=int32)
    

    Using tf.data.Dataset.from_tensor_slices would be this:

    ds2 = tf.data.Dataset.from_tensor_slices((df_with_inputs, df_with_labels)).batch(batch_size)
    inputs_only_ds = ds2.map(lambda x, y: x)
    inputs_only_ds = inputs_only_ds.flat_map(tf.data.Dataset.from_tensor_slices).window(window_size, shift=1, stride=1, drop_remainder=True).flat_map(lambda x: x.batch(window_size)).batch(batch_size)
    ds2 = tf.data.Dataset.zip((inputs_only_ds, ds2.map(lambda x, y: y)))
    for x, y in ds2.take(1):
      print(x, y)
    
    tf.Tensor(
    [[[-0.01898661  1.2348452 ]
      [-0.33379436 -0.13637085]
      [-2.239644    1.5407541 ]
      [-0.14988706  0.50577176]
      [-1.6328571  -0.9512018 ]
      [-3.0481005   0.8019097 ]
      [-0.683125   -0.12166552]
      [-0.5408724  -0.97584397]
      [ 0.47595206  1.0512688 ]
      [ 0.15297593  0.7393363 ]
      [-0.17052855 -0.12541457]
      [ 1.1617764  -2.491248  ]
      [-2.5665069   0.9241422 ]
      [ 0.40681016 -1.031384  ]
      [-0.23945935  1.5275828 ]
      [-1.3431666   0.2940185 ]
      [ 1.7351524   0.34276873]
      [ 0.8059861   2.0647929 ]
      [-0.3017126   0.729208  ]
      [-0.8672192  -0.79938954]]
    
     [[-0.33379436 -0.13637085]
      [-2.239644    1.5407541 ]
      [-0.14988706  0.50577176]
      [-1.6328571  -0.9512018 ]
      [-3.0481005   0.8019097 ]
      [-0.683125   -0.12166552]
      [-0.5408724  -0.97584397]
      [ 0.47595206  1.0512688 ]
      [ 0.15297593  0.7393363 ]
      [-0.17052855 -0.12541457]
      [ 1.1617764  -2.491248  ]
      [-2.5665069   0.9241422 ]
      [ 0.40681016 -1.031384  ]
      [-0.23945935  1.5275828 ]
      [-1.3431666   0.2940185 ]
      [ 1.7351524   0.34276873]
      [ 0.8059861   2.0647929 ]
      [-0.3017126   0.729208  ]
      [-0.8672192  -0.79938954]
      [-0.14423785  0.95039433]]], shape=(2, 20, 2), dtype=float32) tf.Tensor(
    [[1]
     [1]], shape=(2, 1), dtype=int32)
    

    Note that flap_map is necessary to flatten the tensor in order to apply sliding windows more easily. The function flat_map(lambda x: x.batch(window_size)) simply creates batches of the flattened tensor after applying sliding windows.

    With the line inputs_only_ds = ds2.map(lambda x, y: x) I extract only the data (x) without the labels (y) to run sliding windows. Afterwards, in tf.data.Dataset.zip((inputs_only_ds, ds2.map(lambda x, y: y))), I concatenate / zip the dataset with the sliding windows and the labels (y) resulting in the final result ds2.