Search code examples
machine-learningkeraslstmautoencoderanomaly-detection

Understand the output of LSTM autoencoder and use it to detect outliers in a sequence


I try to build LSTM model that as input receives sequence of integer numbers and outputs probability for each integer to appear. If this probability is low, then the integer should be considered as anomaly. I tried to follow this tutorial - https://towardsdatascience.com/lstm-autoencoder-for-extreme-rare-event-classification-in-keras-ce209a224cfb, particularly this is where my model is from. My input looks like this:

[[[3]
  [1]
  [2]
  [0]]

 [[3]
  [1]
  [2]
  [0]]

 [[3]
  [1]
  [2]
  [0]]

However I can't understand what I gain as an output.

[[[ 2.7052343 ]
  [ 1.0618575 ]
  [ 1.8257084 ]
  [-0.54579014]]

 [[ 2.9069736 ]
  [ 1.0850943 ]
  [ 1.9787762 ]
  [ 0.01915958]]

 [[ 2.9069736 ]
  [ 1.0850943 ]
  [ 1.9787762 ]
  [ 0.01915958]]  

Is it reconstruction error? Or the probabilities for each integer? And if so, why they're not in the range of 0-1? I would be grateful if someone could explain this.

The model:

time_steps = 4
features = 1

train_keys_reshaped = train_integer_encoded.reshape(91, time_steps, features)
test_keys_reshaped = test_integer_encoded.reshape(25, time_steps, features)

model = Sequential()
model.add(LSTM(32, activation='relu', input_shape=(time_steps, features), return_sequences=True))
model.add(LSTM(16, activation='relu', return_sequences=False))
model.add(RepeatVector(time_steps)) # to convert 2D output into expected by decoder 3D
model.add(LSTM(16, activation='relu', return_sequences=True))
model.add(LSTM(32, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(features)))

adam = optimizers.Adam(0.0001)
model.compile(loss='mse', optimizer=adam)

model_history = model.fit(train_keys_reshaped, train_keys_reshaped,
                          epochs=700,
                          validation_split=0.1)

predicted_probs = model.predict(test_keys_reshaped) 

Solution

  • As you said it's an autoencoder. Your autoencoder tries to reconstruct your input. As you see, the output values are very close to the input values, there is not a big error. So the autoencoder is well trained.

    Now if you want to detect outliers in your data, you can compute the reconstruction error (Could be Mean square Error between input and output) and set up a threshold.

    If reconstruction error is superior than the threshold it's gonna be an outlier, since the autoencoder is not trained on reconstructing outlier data.

    This schema reprensents better the idea: enter image description here

    I hope this helps ;)