Search code examples
pythontensorflowlstmtf.data.datasetmultivariate-time-series

How to clean nan in tf.data.Dataset in sequences multivariates inputs for LSTM


I try to feed huge dataset (out of memory) to my lstm model. I want to make some transformation on my data using the tf.data.Dataset. I first turn my numpy data to dataset using tf.keras.utils.timeseries_dataset_from_array. This is an exemple of my data:

enter image description here

6 first columns are a feature, last one is my target and row are timesteps.

I turn my 7 features inputs to sequences of 5 timesteps and want to predict the output of one value using this code:

input_dataset = tf.keras.utils.timeseries_dataset_from_array(
        data[:,:-1], None, sequence_length=5, sequence_stride=1, shuffle=True, seed=1)

target_dataset = tf.keras.utils.timeseries_dataset_from_array(
        data[:,-1], None, sequence_length=1, sequence_stride=1,
        shuffle=True, seed=1)

as you see in my data, sometimes values are missing. What I try is to remove all sequences (input with associated output) with a 'nan' in the input OR output.

I try to adapt an exemple and get this:

filter_nan = lambda i, j: not tf.reduce_any(tf.math.is_nan(i)) and not tf.math.is_nan(j)
ds = tf.data.Dataset.zip((input_dataset, output_dataset)).filter(filter_nan)

but get this error :

Using a symbolic `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

I take a look to @tf.function but it is out of my comprehension for the moment and not sure my innitial trial was correct anyway.


Solution

  • The problem is that you used a python logical operator instead of a tensorflow logical operator. There are 2 ways to remedy this. The most direct way you could do this is by replacing the python logical operators with the linked tensorflow logical operator:

    My preferred way to fix this, though, is by filtering the data first and then splitting it into inputs and labels after the fact. You also don't need to repackage the dataset as a dataset. datasets have a built in method called map that you can use to generate a mapped dataset with a function. Here is a code snippet that deletes every window that has NaNs in it and then splits the windows into inputs and labels with the same shape as the ones in your code. I also batched after filtering instead of before by setting batch_size=None and then using the batch method on the filtered dataset. This way, the batch sizes aren't affected by the number of NaNs.

    import numpy as np
    from numpy.lib.stride_tricks import sliding_window_view as windows
    import pandas as pd
    import tensorflow as tf
    
    def split_window(features):
        inputs = features[:, :, :]
        labels = features[:, -1, :]
    
        return inputs, labels
    
    def make_dataset(data):
        data = np.array(data, dtype=np.float32)
        ds = tf.keras.utils.timeseries_dataset_from_array(
            data=data,
            targets=None,
            sequence_length=5,
            sequence_stride=1,
            shuffle=True,
            batch_size=None)
        ds = ds.filter(lambda x: tf.reduce_all(tf.math.logical_not(tf.math.is_nan(x)))).batch(128)
    
        ds = ds.map(split_window)
    
        return ds
    
    data = pd.DataFrame(np.random.rand(2000, 7))
    ds = make_dataset(train_df)
    sample1 = next(iter(ds))
    print(sample1[0].shape, sample1[1].shape)
    

    Output:

    (128, 5, 249) (128, 249)