Search code examples
machine-learningldatopic-modelinggrid-searchgridsearchcv

Integrate GridSearchCV with LDA Gensim


Data Source: Glassdoor reviews split into two dataframe columns "Pros" & Cons"

     - Pros refer to what the employees liked about the company
     - Cons refer to what the employees didn't like about the company

I already did all the pre-processing tretement with removing stopwords,punctuation,lowercase,stemming and lemmatization etc...

Questions:

1) I want to use LDA Topic Modeling algorithm. I heard it regularizes your model to the most optimal based on the optimal combination of parameters. I used the Gensim library. I tried with SickitLearn and it didn't work. It seems like I have to use the sickitlearn LDA to work with gridsearchcv.

2) After finishing with LDA, since it's unsupervised learning, should I test my dataset with other topic modeling algorithms like NMF, LSA & HDP ? And do the same work with them? so that I can pick the best algorithm based on the best metrics for each algorithm ?

3) Is it enough to calculate and compare the coherence score, perplexity between the algorithms to choose the best algorithm?

Code

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from sklearn.model_selection import ParameterGrid

# Create a dictionary of all the words in the "pros" text
pros_dictionary = Dictionary(df['pros'])
# Filter out rare and common words from the "pros" dictionary
pros_dictionary.filter_extremes(no_below=5, no_above=0.5)
# Create a bag-of-words representation of the "pros" text data
pros_corpus = [pros_dictionary.doc2bow(tokens) for tokens in df['pros']]

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'num_topics': [2, 3, 4, 5, 7, 10, 15, 20],  # Possible values for the number of topics
    'passes': [5, 10, 15],  # Possible values for the number of passes
    'alpha': ['symmetric', 'asymmetric'],  # Possible values for alpha
    'eta': [0.01, 0.1, 1.0],  # Possible values for eta
    'iterations': [50, 100, 150, 200]  # Possible values for number of iterations
}
# Perform grid search with coherence score evaluation for "pros" text
best_coherence = -1
best_params = None

for params in ParameterGrid(param_grid):
    lda_model = LdaModel(id2word=pros_dictionary, **params)
    coherence_model = CoherenceModel(model=lda_model, texts=df['pros'], dictionary=pros_dictionary, coherence='c_v')
    coherence = coherence_model.get_coherence()
    
    if coherence > best_coherence:
        best_coherence = coherence
        best_params = params

# Train the LDA model with the best hyperparameters for "pros" text
best_lda_model_pros = LdaModel(id2word=pros_dictionary, **best_params)

# Print the topics and their top keywords for "pros" text
topics = best_lda_model_pros.show_topics(num_topics=best_params['num_topics'], num_words=5)
print("Topics for Pros:")
for topic in topics:
    print(f"Topic {topic[0]}: {topic[1]}")

# Assign the most dominant topic to each document in "pros" text
df['dominant_topic_pros'] = [max(best_lda_model_pros[doc], key=lambda x: x[1])[0] for doc in pros_corpus]

# Explore the dominant topics in the data for "pros" text
topic_counts_pros = df['dominant_topic_pros'].value_counts()
print("Dominant Topic Counts for Pros:")
print(topic_counts_pros)

print("Best LDA Model Parameters for Pros:")
print("Number of Topics:", best_lda_model_pros.num_topics)
print("Alpha:", best_lda_model_pros.alpha)
print("Eta:", best_lda_model_pros.eta)
print("Iterations:", best_lda_model_pros.iterations)
print("Passes:", best_lda_model_pros.passes)


# Calculate perplexity score for Pros
perplexity_pros = best_lda_model_pros.log_perplexity(pros_corpus)
log_likelihood_pros = -perplexity_pros * len(pros_corpus)

# Calculate coherence score for Pros
coherence_model_pros = CoherenceModel(model=best_lda_model_pros, texts=df['pros'], dictionary=pros_dictionary, coherence='c_v')
coherence_score_pros = coherence_model_pros.get_coherence()

# Print the metrics for Pros
print("Metrics for Pros:")
print("Perplexity:", perplexity_pros)
print("Log-Likelihood:", log_likelihood_pros)
print("Coherence Score:", coherence_score_pros)

# Visualize the topics for Pros
pyLDAvis.enable_notebook()
lda_display_pros = gensimvis.prepare(best_lda_model_pros, pros_corpus, pros_dictionary, sort_topics=False)
pyLDAvis.display(lda_display_pros)

Solution

    1. I can not really recognize a question here. Is your current implementation not working?

    2. The package OCTIS (Optimizing and Comparing Topic models Is Simple) is specifically made for this. Could be useful.

    3. Topic modeling metrics are somewhat debated at the moment. There is some research on finding a metric that described how good a topic is. Coherence is traditionally the most used. However, the gold standards for topic quality are metrics that are decided by humans. More specifically word intrusion (showing a topic + one word that is not supposed to be in the topic. And the human needs to pick which one) and topic observed coherence (rating on a 3-point scale).

    Depending on what the purpose of the model is, you could use a combination of metrics to decide the best model. Or you could decide by manual inspection what you deem to be the best model.

    If you are interested, some papers:

    Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

    Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence