python lstm prediction recurrent-neural-network anomaly-detection

Recurrent Neural Network for anomaly detection

I am implementing an anomaly detection system that will be used on different time series (one observation every 15 min for a total of 5 months). All these time series have a common pattern: high levels during working hours and low levels otherwise.

The idea presented in many papers is the following: build a model to predict future values and calculate an anomaly score based on the residuals.

What I have so far

I use an LSTM to predict the next time step given the previous 96 (1 day of observations) and then I calculate the anomaly score as the likelihood that the residuals come from one of the two normal distributions fitted on the residuals obtained with the validation test. I am using two different distributions, one for working hours and one for non working hours.

The model detects very well point anomalies, such as sudden falls and peaks, but it fails during holidays, for example.

If an holiday is during the week, I expect my model to detect more anomalies, because it's an unusual daily pattern wrt a normal working day. But the predictions simply follows the previous observations.

My solution

Use a second and more lightweight model (based on time series decomposition) which is fed with daily aggregations instead of 15min aggregations to detect daily anomalies.

The question

This combination of two models allows me to have both anomalies and it works very well, but my idea was to use only one model because I expected the LSTM to be able to "learn" also the weekly pattern. Instead it strictly follows the previous time steps without taking into consideration that it is a working hour and the level should be much higher. I tried to add exogenous variables to the input (hour of day, day of week), to add layers and number of cells, but the situation is not that better.

Any consideration is appreciated. Thank you

Solution

A note on your current approach

Training with MSE is equivalent to optimizing the likelihood of your data under a Gaussian with fixed variance and mean given by your model. So you are already training an autoencoder, though you do not formulate it so.

About the things you do

You don't give the LSTM a chance

Since you provide data from last 24 hours only, the LSTM cannot possibly learn a weekly pattern. It could at best learn that the value should be similar as it was 24 hours before (though it is very unlikely, see next point) -- and then you break it with Fri-Sat and Sun-Mon data. From the LSTM's point of view, your holiday 'anomaly' looks pretty much the same as the weekend data you were providing during the training.

So you would first need to provide longer contexts during learning (I assume that you carry the hidden state on during test time).
Even if you gave it a chance, it wouldn't care

Assuming that your data really follows a simple pattern -- high value during and only during working hours, plus some variations of smaller scale -- the LSTM doesn't need any long-term knowledge for most of the datapoints. Putting in all my human imagination, I can only envision the LSTM benefiting from long-term dependencies at the beginning of the working hours, so just for one or two samples out of the 96.

So even if the loss value at the points would like to backpropagate through > 7 * 96 timesteps to learn about your weekly pattern, there are 7*95 other loss terms that are likely to prevent the LSTM from deviating from the current local optimum.

Thus it may help to weight the samples at the beginning of working hours more, so that the respective loss can actually influence representations from far history.
Your solutions is a good thing

It is difficult to model sequences at multiple scales in a single model. Even you, as a human, need to "zoom out" to judge longer trends -- that's why all the Wall Street people have Month/Week/Day/Hour/... charts to watch their shares' prices on. Such multiscale modeling is especially difficult for an RNN, because it needs to process all the information, always, with the same weights.

If you really want on model to learn it all, you may have more success with deep feedforward architectures employing some sort of time-convolution, eg. TDNNs, Residual Memory Networks (Disclaimer: I'm one of the authors.), or the recent one-architecture-to-rule-them-all, WaveNet. As these have skip connections over longer temporal context and apply different transformations at different levels, they have better chances of discovering and exploiting such an unexpected long-term dependency.

There are implementations of WaveNet in Keras laying around on GitHub, e.g. 1 or 2. I did not play with them (I've actually moved away from Keras some time ago), but esp. the second one seems really easy, with the AtrousConvolution1D.

If you want to stay with RNNs, Clockwork RNN is probably the model to fit your needs.

About things you may want to consider for your problem

So are there two data distributions?

This one is a bit philosophical. Your current approach shows that you have a very strong belief that there are two different setups: workhours and the rest. You're even OK with changing part of your model (the Gaussian) according to it.

So perhaps your data actually comes from two distributions and you should therefore train two models and switch between them as appropriate?

Given what you have told us, I would actually go for this one (to have a theoretically sound system). You cannot expect your LSTM to learn that there will be low values on Dec 25. Or that there is a deadline and this weekend consists purely of working hours.
Or are there two definitions of anomaly?

One philosophical point more. Perhaps you personally consider two different types of anomaly:

A weird temporal trajectory, unexpected peaks, oscillations, whatever is unusual in your domain. Your LSTM supposedly handles these already.

And then, there is different notion of anomaly: Value of certain bound in certain time intervals. Perhaps a simple linear regression / small MLP from time to value would do here?
Let the NN do all the work

Currently, you effectively model the distribution of your quantity in two steps: First, the LSTM provides the mean. Second, you supply the variance.

You might instead let your NN (together with additional 2 affine transformations) directly provide you with a complete Gaussian by producing its mean and variance; much like in Variational AutoEncoders (https://arxiv.org/pdf/1312.6114.pdf, appendix C.2). Then, you need to optimize directly the likelihood of your following sample under the NN-distribution, rather than just MSE between the sample and the NN output.

This will allow your model to tell you when it is very strict about the following value and when "any" sample will be OK.

Note, that you can take this approach further and have your NN produce "any" suitable distribution. E.g. if your data live in-/can be sensibly transformed to- a limited domain, you may try to produce a Categorical distribution over the space by having a Softmax on the output, much like WaveNet does (https://arxiv.org/pdf/1609.03499.pdf, Section 2.2).