Search code examples
pythonpandastensorflowkeraslazy-evaluation

Tensorflow 2.0 Create a Dataset to feed a model with multiple inputs of different shapes on lazy evaluation


I have a keras model with two inputs of different shape. One side takes in few categorical features, while the other takes multiple time series with length PAST_HISTORY. The output is also multiple time series:

# Categorical data input
input_ct = keras.Input(shape=(len(categ_cols),),
                       name='categorical_input')

# Timeseries input
input_ts = keras.Input(shape=(PAST_HISTORY, len(series_cols)),
                       name='timeseries_input')

...

model = keras.models.Model(inputs=[input_ct, input_ts], outputs=outputs)

I created a Dataset for each input and for the output using a pandas DataFrame and some tf.data.Dataset operations.

df_ts = df[series_cols][:-FUTURE_TARGET]
ts_batch = lambda window: window.batch(PAST_HISTORY)
time_series_data = tf.data.Dataset.from_tensor_slices(df_ts)\
    .window(PAST_HISTORY, 1, 1, True)\
    .flat_map(ts_batch)

df_cat = df[categ_cols][PAST_HISTORY - 1:-FUTURE_TARGET]
date_data = tf.data.Dataset.from_tensor_slices(df_cat)

df_target = df[target_cols][PAST_HISTORY:]
target_batch = lambda window: window.batch(FUTURE_TARGET)
target_data = tf.data.Dataset.from_tensor_slices(df_target)\
    .window(FUTURE_TARGET, 1, 1, True)\
    .flat_map(target_batch)

To create the final Dataset I used a generator:

def generator():
    for d1, d2, t in zip(date_data, time_series_data, target_data):
        yield {"categorical_input": d1, "timeseries_input": d2}, tf.transpose(t)

dataset = tf.data.Dataset.from_generator(generator,
    output_types=(
        {'categorical_input': tf.int64, 'timeseries_input': tf.float64},
        tf.float64),
    output_shapes=(
        {'categorical_input': (len(categ_cols),),'timeseries_input': (PAST_HISTORY, len(series_cols))},
        (len(target_cols), FUTURE_TARGET),))

This worked and I managed to train a model on eager execution by calling model.fit. However now that I'm trying to create an Estimator from this model the creation of the Dataset no longer works as it implicitly uses the __iterator__ function which is disallowed on lazy evaluation. Specifically the problem lies in the zip operation on the generator.

I tried to create the same dataset without the generator with the following code:

dataset = tf.data.Dataset.from_tensors(
        ({'categorical_input': date_data, 'timeseries_input': time_series_data}, target_data)
)

This gets me following error when I try to call estimator.train:

TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops._NestedVariant'> to Tensor.
Contents: <tensorflow.python.data.ops.dataset_ops._NestedVariant object at 0x7f5bf84a97f0>.
Consider casting elements to a supported type.

What is the way to solve this error? Or is there another way to construct this Dataset without having to call an iterator on a Dataset?

Also, I tried to cast the Datasets and got the following error on the windowed Datasets:

TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops.FlatMapDataset'> to Tensor.
Contents: <FlatMapDataset shapes: (None, 2), types: tf.float64>.
Consider casting elements to a supported type.

Dummy data:

df = pd.DataFrame(data={
        'ts_1': np.random.rand(10000),
        'ts_2': np.random.rand(10000),
        'ts_objective': np.random.rand(10000),
        'cat_1': np.random.randint(1, 10 + 1, 10000),
        'cat_2': np.random.randint(1, 25 + 1, 10000),
        'cat_3': np.random.randint(1, 30 + 1, 10000),
        'cat_4': np.random.randint(1, 50 + 1, 10000)})

categ_cols = ['cat_1', 'cat_2', 'cat_3', 'cat_4']
series_cols = ['ts_1', 'ts_2']
target_cols = ['ts_objective']

PAST_HISTORY = 24
FUTURE_TARGET = 8

Solution

  • You can build the dataset you need without using a generator (and much faster) using Dataset operations only:

    import tensorflow as tf
    
    date_data = ...
    time_series_data = ...
    target_data = ...
    
    def data_tx(d1, d2, t):
        return {"categorical_input": d1, "timeseries_input": d2}, tf.transpose(t)
    dataset = tf.data.Dataset.zip((date_data, time_series_data, target_data)).map(data_tx)