Search code examples
pythontensorflowdeep-learningtime-seriestensorflow-datasets

Creating Tensorflow Dataset for mulitple time series


I have a multiple time series data that looks something like this:

df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                   'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                   'Feature1': np.random.randint(10, size=10),
                   'Feature2': np.random.randint(10, size=10)})

   Time  Object  Feature1  Feature2
0     0       1         3         3
1     1       1         9         2
2     2       1         6         6
3     3       1         4         0
4     4       1         7         7
5     0       2         4         8
6     1       2         3         7
7     2       2         1         1
8     3       2         7         5
9     4       2         1         7

where each object (1 and 2) has its own data (about 2000 objects in real data). I would like to feed this data chunkwise into RNN/LSTM using tf.data.Dataset.window in a way that different objects data don't come in one window like in this example:

dataset = tf.data.Dataset.from_tensor_slices(df)

for w in dataset.window(3, shift=1, drop_remainder=True):
  print(list(w.as_numpy_iterator()))

Output:

[array([0, 1, 3, 3]), array([1, 1, 9, 2]), array([2, 1, 6, 6])]
[array([1, 1, 9, 2]), array([2, 1, 6, 6]), array([3, 1, 4, 0])]
[array([2, 1, 6, 6]), array([3, 1, 4, 0]), array([4, 1, 7, 7])]
[array([3, 1, 4, 0]), array([4, 1, 7, 7]), array([0, 2, 4, 8])] # Mixed data from both objects
[array([4, 1, 7, 7]), array([0, 2, 4, 8]), array([1, 2, 3, 7])] # Mixed data from both objects
[array([0, 2, 4, 8]), array([1, 2, 3, 7]), array([2, 2, 1, 1])]
[array([1, 2, 3, 7]), array([2, 2, 1, 1]), array([3, 2, 7, 5])]
[array([2, 2, 1, 1]), array([3, 2, 7, 5]), array([4, 2, 1, 7])]

Expected output:

[array([0, 1, 3, 3]), array([1, 1, 9, 2]), array([2, 1, 6, 6])]
[array([1, 1, 9, 2]), array([2, 1, 6, 6]), array([3, 1, 4, 0])]
[array([2, 1, 6, 6]), array([3, 1, 4, 0]), array([4, 1, 7, 7])]
[array([0, 2, 4, 8]), array([1, 2, 3, 7]), array([2, 2, 1, 1])]
[array([1, 2, 3, 7]), array([2, 2, 1, 1]), array([3, 2, 7, 5])]
[array([2, 2, 1, 1]), array([3, 2, 7, 5]), array([4, 2, 1, 7])]

Maybe there is another way to do it. The main requirement that my model should see that non-mixed data chunks come from different objects (maybe via embedding).


Solution

  • Hmm, maybe just create two separate dataframes and then concatenate after windowing. That way, you will not have any overlapping:

    import tensorflow as tf
    import pandas as pd
    import numpy as np
    
    
    df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                       'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                       'Feature1': np.random.randint(10, size=10),
                       'Feature2': np.random.randint(10, size=10)})
    
    df1 = df[df['Object'] == 1]
    df2 = df[df['Object'] == 2]
    
    dataset = tf.data.Dataset.from_tensor_slices(df1).window(3, shift=1, drop_remainder=True).concatenate(tf.data.Dataset.from_tensor_slices(df2).window(3, shift=1, drop_remainder=True))
    
    for w in dataset:
      print(list(w.as_numpy_iterator()))
    
    [array([0, 1, 3, 3]), array([1, 1, 9, 2]), array([2, 1, 6, 6])]
    [array([1, 1, 9, 2]), array([2, 1, 6, 6]), array([3, 1, 4, 0])]
    [array([2, 1, 6, 6]), array([3, 1, 4, 0]), array([4, 1, 7, 7])]
    [array([0, 2, 4, 8]), array([1, 2, 3, 7]), array([2, 2, 1, 1])]
    [array([1, 2, 3, 7]), array([2, 2, 1, 1]), array([3, 2, 7, 5])]
    [array([2, 2, 1, 1]), array([3, 2, 7, 5]), array([4, 2, 1, 7])]
    

    Update 1:

    Another approach would be to use tf.data.Dataset.filter like this:

    import tensorflow as tf
    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                       'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                       'Feature1': np.random.randint(10, size=10),
                       'Feature2': np.random.randint(10, size=10)})
    
    objects = df['Object'].unique()
    dataset = tf.data.Dataset.from_tensor_slices(df)
    new_dataset = None
    
    for o in objects:
      temp_dataset = dataset.filter(lambda x: tf.math.equal(x[1], tf.constant(o))).window(3, shift=1, drop_remainder=True)
      if new_dataset:
        new_dataset = new_dataset.concatenate(temp_dataset)
      else:
        new_dataset = temp_dataset
    
    for w in new_dataset:
      print(list(w.as_numpy_iterator()))
    

    Update 2: Yet another option would be to exclude / drop overlapping sequences. This way you can flexibly decide what to do with the overlaps:

    import tensorflow as tf
    import pandas as pd
    import numpy as np
    
    
    df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                       'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                       'Feature1': np.random.randint(10, size=10),
                       'Feature2': np.random.randint(10, size=10)})
    
    dataset = tf.data.Dataset.from_tensor_slices(df).window(3, shift=1, drop_remainder=True).flat_map(lambda x: x.batch(3)).filter(lambda y: tf.reduce_all(tf.unique(y[..., 1])[1] == 0))
    
    for w in dataset:
      print(w)