python tensorflow lstm tf.data.dataset multivariate-time-series

How to clean nan in tf.data.Dataset in sequences multivariates inputs for LSTM

I try to feed huge dataset (out of memory) to my lstm model. I want to make some transformation on my data using the tf.data.Dataset. I first turn my numpy data to dataset using tf.keras.utils.timeseries_dataset_from_array. This is an exemple of my data:

6 first columns are a feature, last one is my target and row are timesteps.

I turn my 7 features inputs to sequences of 5 timesteps and want to predict the output of one value using this code:

input_dataset = tf.keras.utils.timeseries_dataset_from_array(
        data[:,:-1], None, sequence_length=5, sequence_stride=1, shuffle=True, seed=1)

target_dataset = tf.keras.utils.timeseries_dataset_from_array(
        data[:,-1], None, sequence_length=1, sequence_stride=1,
        shuffle=True, seed=1)

as you see in my data, sometimes values are missing. What I try is to remove all sequences (input with associated output) with a 'nan' in the input OR output.

I try to adapt an exemple and get this:

filter_nan = lambda i, j: not tf.reduce_any(tf.math.is_nan(i)) and not tf.math.is_nan(j)
ds = tf.data.Dataset.zip((input_dataset, output_dataset)).filter(filter_nan)

but get this error :

Using a symbolic `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

I take a look to @tf.function but it is out of my comprehension for the moment and not sure my innitial trial was correct anyway.

Solution

The problem is that you used a python logical operator instead of a tensorflow logical operator. There are 2 ways to remedy this. The most direct way you could do this is by replacing the python logical operators with the linked tensorflow logical operator:

tensorflow logical_and - tf.math.logical_and(x, y, name=None)
tensorflow logical_not - tf.math.logical_not(x, name=None).

My preferred way to fix this, though, is by filtering the data first and then splitting it into inputs and labels after the fact. You also don't need to repackage the dataset as a dataset. datasets have a built in method called map that you can use to generate a mapped dataset with a function. Here is a code snippet that deletes every window that has NaNs in it and then splits the windows into inputs and labels with the same shape as the ones in your code. I also batched after filtering instead of before by setting batch_size=None and then using the batch method on the filtered dataset. This way, the batch sizes aren't affected by the number of NaNs.

import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as windows
import pandas as pd
import tensorflow as tf

def split_window(features):
    inputs = features[:, :, :]
    labels = features[:, -1, :]

    return inputs, labels

def make_dataset(data):
    data = np.array(data, dtype=np.float32)
    ds = tf.keras.utils.timeseries_dataset_from_array(
        data=data,
        targets=None,
        sequence_length=5,
        sequence_stride=1,
        shuffle=True,
        batch_size=None)
    ds = ds.filter(lambda x: tf.reduce_all(tf.math.logical_not(tf.math.is_nan(x)))).batch(128)

    ds = ds.map(split_window)

    return ds

data = pd.DataFrame(np.random.rand(2000, 7))
ds = make_dataset(train_df)
sample1 = next(iter(ds))
print(sample1[0].shape, sample1[1].shape)

Output:

(128, 5, 249) (128, 249)