Search code examples
pythontensorflowlstmautoencoderanomaly-detection

LSTM autoencoder for anomaly detection


I'm testing out different implementation of LSTM autoencoder on anomaly detection on 2D input. My question is not about the code itself but about understanding the underlying behavior of each network.

Both implementation have the same number of units (16). Model 2 is a "typical" seq to seq autoencoder with the last sequence of the encoder repeated "n" time to match the input of the decoder. I'd like to understand why Model 1 seem to easily over-perform Model 2 and why Model 2 isn't able to do better than the mean ?

Model 1:

class LSTM_Detector(Model):
  def __init__(self, flight_len, param_len, hidden_state=16):
    super(LSTM_Detector, self).__init__()
    self.input_dim = (flight_len, param_len)
    self.units = hidden_state
    self.encoder = layers.LSTM(self.units,
                  return_state=True,
                  return_sequences=True,
                  activation="tanh",
                  name='encoder',
                  input_shape=self.input_dim)
    
    self.decoder = layers.LSTM(self.units,
                  return_sequences=True,
                  activation="tanh",
                  name="decoder",
                  input_shape=(self.input_dim[0],self.units))
    
    self.dense = layers.TimeDistributed(layers.Dense(self.input_dim[1]))
    
  def call(self, x):
    output, hs, cs = self.encoder(x)
    encoded_state = [hs, cs] # see https://www.tensorflow.org/guide/keras/rnn  
    decoded = self.decoder(output, initial_state=encoded_state)
    output_decoder = self.dense(decoded)

    return output_decoder

Model 2:

class Seq2Seq_Detector(Model):
  def __init__(self, flight_len, param_len, hidden_state=16):
    super(Seq2Seq_Detector, self).__init__()
    self.input_dim = (flight_len, param_len)
    self.units = hidden_state
    self.encoder = layers.LSTM(self.units,
                  return_state=True,
                  return_sequences=False,
                  activation="tanh",
                  name='encoder',
                  input_shape=self.input_dim)
    
    self.repeat = layers.RepeatVector(self.input_dim[0])
    
    self.decoder = layers.LSTM(self.units,
                  return_sequences=True,
                  activation="tanh",
                  name="decoder",
                  input_shape=(self.input_dim[0],self.units))
    
    self.dense = layers.TimeDistributed(layers.Dense(self.input_dim[1]))
    
  def call(self, x):
    output, hs, cs = self.encoder(x)
    encoded_state = [hs, cs] # see https://www.tensorflow.org/guide/keras/rnn 
    repeated_vec = self.repeat(output)
    decoded = self.decoder(repeated_vec, initial_state=encoded_state)
    output_decoder = self.dense(decoded)

    return output_decoder

I fitted this 2 models for 200 Epochs on a sample of data (89, 1500, 77) each input being a 2D aray of (1500, 77). And the test data (10,1500,77). Both model had only 16 units.

Here or the results of the autoencoder on one features of the test data.

Results Model 1: (black line is truth, red in reconstructed image)

enter image description here

Results Model 2:

enter image description here

I understand the second one is more restrictive since all the information from the input sequence is compressed into one step, but I'm still surprise that it's barely able to do better than predict the average.

On the other hand, I feel Model 1 tends to be more "influenced" by new data without giving back the input. see example below of Model 1 having a flat line as input :

enter image description here

PS : I know it's not a lot of data for that kind of model, I have much more available but at this stage I'm just experimenting and trying to build my understanding.

PS 2 : Neither models overfitted their data and the training and validation curve are almost text book like.

Why is there such a gap in term of behavior?


Solution

  • In model 1, each point of 77 features is compressed and decompressed this way: 77->16->16->77 plus some info from the previous steps. It seems that replacing LSTMs with just TimeDistributed(Dense(...)) may also work in this case, but cannot say for sure as I don't know the data. The third image may become better.

    What predicts model 2 usually happens when there is no useful signal in the input and the best thing model can do (well, optimize to do) is just to predict the mean target value of the training set.

    In model 2 you have:

    ...
        self.encoder = layers.LSTM(self.units,
                      return_state=True,
                      return_sequences=False,
    ...
    

    and then

        self.repeat = layers.RepeatVector(self.input_dim[0])
    

    So, in fact, when it does

        repeated_vec = self.repeat(output)
        decoded = self.decoder(repeated_vec, initial_state=encoded_state)
    

    it just takes only one last output from the encoder (which in this case represents the last step of 1500), copies it 1500 times (input_dim[0]), and tries to predict all 1500 values from the information about a couple of last ones. Here is where the model loses most of the useful signal. It does not have enough/any information about the input, and the best thing it can learn in order to minimize the loss function (which I suppose in this case is MSE or MAE) is to predict the mean value for each of the features.

    Also, a seq to seq model usually passes a prediction of a decoder step as an input to the next decoder step, in the current case, it is always the same value.

    TL;DR 1) seq-to-seq is not the best model for this case; 2) due to the bottleneck it cannot really learn to do anything better than just to predict the mean value for each feature.