Search code examples
pythonscikit-learnnlplda

Scoring strategy of sklearn.model_selection.GridSearchCV for LatentDirichletAllocation


I am trying to apply GridSearchCV on the LatentDirichletAllocation using the sklearn library.

Current pipeline looks like so:

vectorizer = CountVectorizer(analyzer='word',       
                         min_df=10,                      
                         stop_words='english',           
                         lowercase=True,                 
                         token_pattern='[a-zA-Z0-9]{3,}'
                        )

data_vectorized = vectorizer.fit_transform(doc_clean) #where doc_clean is processed text.

lda_model = LatentDirichletAllocation(n_components =number_of_topics,
                                    max_iter=10,            
                                    learning_method='online',   
                                    random_state=100,       
                                    batch_size=128,         
                                    evaluate_every = -1,    
                                    n_jobs = -1,            
                                    )

search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}
model = GridSearchCV(lda_model, param_grid=search_params)
model.fit(data_vectorized)

Current the GridSearchCV uses the approximate log-likelihood as score to determine which is the best model. What I would like to do is to change my scoring method to be based on the approximate perplexity of the model instead.

According to sklearn's documentation of GridSearchCV, there is a scoring argument that I can use. However, I do not know how to apply perplexity as a scoring method, and I cannot find any examples online of people applying it. Is this possible?


Solution

  • GridSearchCV on its default will use the score() function of final estimator in the pipeline.

    make_scorer can be used here, but for calculating perplexity you will need other data from the fitted model as well, which could be a little complex to provide through make_scorer.

    You can make a wrapper over your LDA here and in which you can re-implement the score() function to return perplexity. Something along the lines:

    class MyLDAWithPerplexityScorer(LatentDirichletAllocation):
    
        def score(self, X, y=None):
    
            # You can change the options passed to perplexity here
            score = super(MyLDAWithPerplexityScorer, self).perplexity(X, sub_sampling=False)
    
            # Since perplexity is lower for better, so we do negative
            return -1*score
    

    And then can use this in place of LatentDirichletAllocation in your code like:

    ...
    ...
    ...
    lda_model = MyLDAWithPerplexityScorer(n_components =number_of_topics,
                                    ....
                                    ....   
                                    n_jobs = -1,            
                                    )
    ...
    ...