Search code examples
nlpgensimtopic-modelingn-gram

Should bi-gram and tri-gram be used in LDA topic modeling?


I read several posts(here and here) online about LDA topic modeling. All of them only use uni-grams. I would like to know why bi-grams and tri-grams are not used for LDA topic modeling?


Solution

  • It's a matter of scale. If you have 1000 types (ie "dictionary words"), you might end up (in the worst case, which is not going to happen) with 1,000,000 bigrams, and 1,000,000,000 trigrams. These numbers are hard to manage, especially as you will have a lot more types in a realistic text.

    The gains in accuracy/performance don't outweigh the computational cost here.