sklearn likelihood from latent dirichlet allocation

I want to use the latent dirichlet allocation from sklearn for anomaly detection. I need to obtain the likelihood for a new samples as formally described in equation here.

How can I get that?

Solution

Solution to your problem

You should be using the score() method of the model which returns the log likelihood of the passed in documents.

Assuming you have created your documents as per the paper and trained an LDA model for each host. You should then get the lowest likelihood from all the training documents and use it as a threshold. Example untested code follows:

import numpy as np
from sklearn.decomposition import LatentDirichletAllocation

# Assuming X contains a host's training documents
# and X_unknown contains the test documents
lda = LatentDirichletAllocation(... parameters here ...)
lda.fit(X)
threshold = min([lda.score([x]) for x in X])
attacks = [
    i for i, x in enumerate(X_unknown)
    if lda.score([x]) < threshold
]

# attacks now contains the indexes of the anomalies

Exactly what you asked

If you want to use exact equation in the paper you linked I would advise against trying to do it in scikit-learn because the expectation step interface is not clear.

The parameters θ and φ can be found at lines 112 - 130 as doc_topic_d and norm_phi. The function _update_doc_distribution() returns the doc_topic_distribution and the sufficient statistics from which you could try to infer the θ and φ by the following again untested code:

theta = doc_topic_d / doc_topic_d.sum()
# see the variables exp_doc_topic_d in the source code
# in the function _update_doc_distribution()
phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS

Suggestion for another library

If you want to have more control over the expectation and maximization steps and the variational parameters I would suggest you look at LDA++ and specifically the EStepInterface (disclaimer I am one of the authors of LDA++).