Search code examples
pythonnltkgensimn-gram

Capture bigram topics instead of unigrams using latent dirichlet allocat


I try to make an attempt like this question

LDA Original Output

Uni-grams

    topic1 -scuba,water,vapor,diving

    topic2 -dioxide,plants,green,carbon

Required Output

Bi-gram topics

    topic1 -scuba diving,water vapor

    topic2 -green plants,carbon dioxide

And there is this answer

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

Any help what update should I make in order to have only bigrams?


Solution

  • Create only documents with bigrams:

    from nltk.util import ngrams
    
    for doc in docs:
        docs[doc] = ["_".join(w) for w in ngrams(docs[doc], 2)]
    

    Or specific method for bigrams:

    from nltk.util import bigrams
    
    for doc in docs:
        docs[doc] = ["_".join(w) for w in bigrams(docs[doc])]
    

    Then use lists of these bigrams in texts for future operations.