Initial Value of the Y Features in a Keras LSTM Model

I'm trying to train a ConvLSTM2D model in Keras using the Functional API, and I'm confused on one point. I've been reading all day, and I'm not sure I really understand what I'm doing yet. The short version is, I don't think I need to use a stateful model, but I'm trying to figure out how to make the model take into account the initial values (at t0) of the target features (Y) and then predict the target values for the rest of the sequence (at t1 to whenever).

The task is something like predicting rain: the amount of rain at time t at position x, y is a product of a variety of different features (like wind speed, altitude, etc.), and, of course, the amount of precipitation during earlier periods, because if it's already raining it's more likely to keep on raining. Because the state of the weather now should help us predict the weather in the future, the target values are shifted (or lagged) by one time step, so x at t0 predicts y at t1.

I inherited some code from another team, who were about as ignorant as I am, but much more confident in their abilities. A key problem in their model estimation (specifically, in the way they coded the data generator) was that they used the same vector of features for both the X and Y arrays (applying a shift to the Y array), with the unfortunate consequence that they were predicting all the features, not just the one of interest. The model definition and training code looks something like this:

inputs = layers.Input(shape = (sequence_length, x_grid_length, y_grid_length, num_features))
outputs = layers.ConvLSTM2D(filters = 32,
                            kernel_size = (5,5),
                            padding = "same",
                            return_sequences = True,
                            stateful = False,
                            activation = "relu")(inputs)
outputs = layers.ConvLSTM2D(filters = 32,
                            kernel_size = (3,3),
                            padding = "same",
                            return_sequences = True,
                            stateful = False,
                            activation = "relu")(outputs)
outputs = layers.Conv3D(filters = 1,
                        kernel_size = (3, 3, 3),
                        padding = "same",
                        activation = "sigmoid")(outputs)
model = keras.models.Model(inputs = inputs, outputs = outputs)
loss = keras.losses.mean_squared_error
opt = keras.optimizers.Adam(learning_rate = 0.001)
model.compile(loss = loss, optimizer = opt)

history = model.fit(training_data,
                    epochs = 8,
                    verbose = 2,
                    validation_data = val_data)

Here, the variables training_data and val_data are instances of a custom subclass of Keras Sequence. I modified the sequence class to return only the target values as Y (i.e., the amount of rain) with "return_sequence" set to True, while still returning the entire feature set, including the target column, as X.

After the first attempt at training, I realized I had a problem: if I try to predict from the test set, I'm feeding the correct values for precipitation amount into the prediction, except of course for the final time step--which kind of defeats the purpose of an RNN. That's obviously not right, so I then modified the sequence class to supply only the non-target features for X. But that's not right either, because if it's already raining at the start of the sequence, that information isn't getting into the model.

After a lot of reading, I don't think I want a stateful model, because I only really care about the target value at t0 (and I'm training overlapping sequences, starting at t0, t1, t2 etc., which wouldn't fit correctly into a stateful model, where I'd want to use the last state of one sequence as the first state of the next). I realize that even with a stateless model, I can use reset_state to specify an initial state--but I want to specify the initial values of the target features, not specify the initial state of a hidden layer. How can I do that?

Solution

After further reading and a good night's sleep, I think I have an answer, though I won't mark it yet, in case someone has a better one. As I wrote above, the real problem here is with prediction. Indeed, the problem doesn't even affect model training, but I do need to do some extra work to use the model for predictioning multiple time steps into the future.

For model training, there's nothing at all wrong with feeding the target values into the model as both a feature in the X array and also as the actual targets in the Y array (lagged/shifted by one time period, of course). That's just autoregression (or the deep learning equivalent of it), which statisticians have been doing for decades with time series data.

However, prediction is a different story, because what I really need to do is to predict multiple steps into the future. The current model definition and Sequence sub-class will work fine for predicting one time step into the future--that is, after all, exactly what happens when your target value is lagged by one time step. Going multiple steps is trickier.

I do have the data to predict multiple time steps: I have runs of 49 steps each, which, for training purposes, allow me to create a number of sequences for each run equal to 49 minus the sequence length (which allows for an extra step required to provide lagged values of the targets). The trick then is to take the predictions of the target values for the first time step, then feed those back into the X array, using them in place of the original data for the target values of future steps. That's going to require either modifying my Sequence sub-class to include a prediction mode that takes the previous step's predictions as input, or creating an entirely new sub-class, or just feeding the sequences used for prediction in one by one (which is slower, but that's not a big consideration unless you want to do lots of predictions).

It's probably also a good idea to generate loss statistics for these multi-step predictions, since the loss Keras calculates during training is only for a one-step prediction; I'll have to add code to do this, comparing the predicted target values to the target values in the original data.

EDIT: It's also occurred to me that I can't see a reason to set "return_sequences" in the last ConvLSTM2D layer to True--the original team, for some reason, was trying to predict the entire sequence, but I really only need the final output of each sequence.