Search code examples
machine-learningencodinglstmrecurrent-neural-networktraining-data

Encoding Time Series Forecasting with LSTM Networks


I have a big dataset which contains entries in the form of:

user_id, measurement_date, value1, value2,..

The challenge that comes up is how to handle gaps in the data. The measurements were taken randomly so there will always be smaller as well as very big gaps.

What is the best way to handle missing data here.

I am thinking of the following approaches:

  • for all non-existent measurements use a special vector. (this leads to unpractical training data, since the entries of non-measurements take over)
  • like the above but group multiple non-measurements into one vector, eg. introducing a vector representing the count of days when no measurement was taken.

My question now is what is the best way to encode this.

At the moment the LSTM network get the input in form of unencoded input vectors:

vector1, vector2,..

The vectors contain the values.

But now when I indroduce the new symbols like:

  s1 := <=3 days no measurement taken
  s2 := <=7 ..

I would hot encode them.

Is it best to introduce a prefix that destinguises between the two word types?

E.g.

 1 vector -> 1, value1, value2
 0 vecotr -> 0, 0, 1 (s1)
          -> 0, 1, 0 (s2)

Solution

  • Acutally it is not possible encode it either way.