neural-network keras lstm recurrent-neural-network training-data

Interesting results from LSTM RNN : lagged results for train and validation data

As an introduction to RNN/LSTM (stateless) I'm training a model with sequences of 200 days of previous data (X), including things like daily price change, daily volume change, etc and for the labels/Y I have the % price change from current price to that in 4 months. Basically I want to estimate the market direction, not to be 100% accurate. But I'm getting some odd results...

When I then test my model with the training data, I notice the output from the model is a perfect fit when compared to the actual data, it just lags by exactly 4 months:

When I shift the data by 4 months, you can see it's a perfect fit.

I can obviously understand why the training data would be a very close fit as it has seen it all during training - but why the 4 months lag?

It does the same thing with the validation data (note the area I highlighted with the red box for future reference):

Time-shifted:

It's not as close-fitting as the training data, as you'd expect, but still too close for my liking - I just don't think it can be this accurate (see the little blip in the red rectangle as an example). I think the model is acting as a naive predictor, I just can't work out how/why it's possibly doing it.

To generate this output from the validation data, I input a sequence of 200 timesteps, but there's nothing in the data sequence that says what the %price change will be in 4 months - it's entirely disconnected, so how is it so accurate? The 4-month lag is obviously another indicator that something's not right here, I don't know how to explain that, but I suspect the two are linked.

Solution

Okay, I realised my error; the way I was using the model to generate the forecast line was naive. For every date in the graph above, I was getting an output from the model, and then apply the forecasted % change to the actual price for that date - that would give predicted price in 4 months' time.

Given the markets usually only move within a margin of 0-3% (plus or minus) over a 4 month period, that would mean my forecasts was always going to closely mirror the current price, just with a 4 month lag.

So at every date the predicted output was being re-based, so the model line would never deviate far from the actual; it'd be the same, but within a margin of 0-3% (plus or minus).

Really, the graph isn't important, and it doesn't reflect the way I'll use the output anyway, so I'm going to ditch trying to get a visual representation, and concentrate on trying to find different metrics that lower the validation loss.