Search code examples
scikit-learnanomaly-detection

sklearn likelihood from latent dirichlet allocation


I want to use the latent dirichlet allocation from sklearn for anomaly detection. I need to obtain the likelihood for a new samples as formally described in equation here.

How can I get that?


Solution

  • Solution to your problem

    You should be using the score() method of the model which returns the log likelihood of the passed in documents.

    Assuming you have created your documents as per the paper and trained an LDA model for each host. You should then get the lowest likelihood from all the training documents and use it as a threshold. Example untested code follows:

    import numpy as np
    from sklearn.decomposition import LatentDirichletAllocation
    
    # Assuming X contains a host's training documents
    # and X_unknown contains the test documents
    lda = LatentDirichletAllocation(... parameters here ...)
    lda.fit(X)
    threshold = min([lda.score([x]) for x in X])
    attacks = [
        i for i, x in enumerate(X_unknown)
        if lda.score([x]) < threshold
    ]
    
    # attacks now contains the indexes of the anomalies
    

    Exactly what you asked

    If you want to use exact equation in the paper you linked I would advise against trying to do it in scikit-learn because the expectation step interface is not clear.

    The parameters θ and φ can be found at lines 112 - 130 as doc_topic_d and norm_phi. The function _update_doc_distribution() returns the doc_topic_distribution and the sufficient statistics from which you could try to infer the θ and φ by the following again untested code:

    theta = doc_topic_d / doc_topic_d.sum()
    # see the variables exp_doc_topic_d in the source code
    # in the function _update_doc_distribution()
    phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS
    

    Suggestion for another library

    If you want to have more control over the expectation and maximization steps and the variational parameters I would suggest you look at LDA++ and specifically the EStepInterface (disclaimer I am one of the authors of LDA++).