Search code examples
pythonmachine-learningtensorflowlstmtflearn

Delay gap between reality and prediction


Using machine learning (as library I've tried Tensorflow and Tflearn (which, I know is just a wrapping of Tensorflow)) I'm trying to predict the congestion in an area for the next week (see my previous questions if you want more backstory on it). My training set is composed of 400K tagged entry (with the date an congestion value for each minute).

My problem is that I now have a time gap between predictions and reality.

If I had to draw a chart with the reality and prediction you would see that my prediction, while have the same shape as the reality is in advance. She increase/decrease before the reality. It started to make me think that maybe my training had a problem. It would seem like that my prediction didn't start when my training ended.

Both of my data-sets (training/testing) are on 2 different file. First I train on my training set (for convenience sake let's say it end at 100th minutes and my testing set start at 101th minute), once my model saved I do my predictions, it should normally then start to predict 101 or am I wrong somewhere? Because it seem like it's starting to predict way way after my training stopped (if I keep my example it would start predicting value 107 for example).

For now one of a bad fix was to remove from the training set as many value as I had of delay (take this example, it would be 7) and it worked, no more delay but I don't understand why I have this problem or how to fix it so it wouldn't happen later.

Following some advices found on different website it seem like having gap in my training dataset (missing timestamp in this case) could be a problem, seeing that there do was some (in total around 7 to 9% of the whole dataset was missing) I've used Pandas to add the missing timestamps (I've also gave them the congestion value of the last know timestamp) while I do think that it may have helped a little (the gap is smaller) it haven't fixed the problem.

I tried multistep forecasting, multivariate forecasting, LSTM, GRU, MLP, Tensorflow, Tflearn but it change nothing making me think it could come from my training. Here is my model training.

def fit_lstm(train, batch_size, nb_epoch, neurons):
    X, y = train[:, 0:-1], train[:, -1]
    X = X.reshape(X.shape[0], 1, X.shape[1])
    print X.shape
    print y.shape
    
    model = Sequential()
    model.add(LSTM(neurons, batch_input_shape=(None, X.shape[1], X.shape[2]), stateful=False))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    for i in range(nb_epoch):
        model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
        model.reset_states()
    return model

The 2 shape are :

(80485, 1, 1)

(80485,)

(On this example I'm using only 80K of data as training for speed purpose).

As parameter I'm using 1 neuron, 64 of batch_size and 5 epoch. My dataset is made of 2 file. First is the training file with 2 column:

timestamp | values

The second have the same shape but is the testing set (separated to avoid any influence of it on my prediction), the file is only used once every prediction have been made and to compare reality and prediction. The testing set start where the training set stop.

Do you have an idea of what could be the reason of this problem?

Edit: On my code I have this function:

# invert differencing
    yhat = inverse_difference(raw_values, yhat, len(test_scaled)+1-i)
# invert differenced value
def inverse_difference(history, yhat, interval=1):
    return yhat + history[-interval]

It's supposed to invert the difference (to go from a scaled value to the real one). When using it like in the pasted example (using the testing set) I get perfection, accuracy above 95% and no gap. Using the testing set to inverse the difference

Since in reality we wouldn't know theses values I had to change it. I tried first to use the training set but got the problem explained on this post: Using the training set

Why is this happening? Is there an explanation for this problem?


Solution

  • Found it. It was a problem with the "def inverse_difference(history, yhat, interval=1):" function. In fact it make my result look like my last lines of training. This is why I had a gap, since there is a pattern in my data (peak always at more or less the same moment) I thought he was doing prediction while he was just giving me back values from training.