Search code examples
gensimtf-idflatent-semantic-indexinglatent-semantic-analysis

Which formula of tf-idf does the LSA model of gensim use?


There are many different ways in which tf and idf can be calculated. I want to know which formula is used by gensim in its LSA model. I have been going through its source code lsimodel.py, but it is not obvious to me where the document-term matrix is created (probably because of memory optimizations).

In one LSA paper, I read that each cell of the document-term matrix is the log-frequency of that word in that document, divided by the entropy of that word:

tf(w, d) = log(1 + frequency(w, d))
idf(w, D) = 1 / (-Σ_D p(w) log p(w))

However, this seems to be a very unusual formulation of tf-idf. A more familiar form of tf-idf is:

tf(w, d) = frequency(w, d)
idf(w, D) = log(|D| / |{d ∈ D: w ∈ d}|)

I also notice that there is a question on how the TfIdfModel itself is implemented in gensim. However, I didn't see lsimodel.py importing TfIdfModel, and therefore can only assume that lsimodel.py has its own implementation of tf-idf.


Solution

  • As I understand, lsimodel.py does not preform the tf-idf encoding step. You may find some details in gensim's API documentation - there's a dedicated tf-idf model, which can be employed to encode a text that can be later fed into the LSA model. From the tfidfmodel.py source code it appears that the latter of two definitions of tf-idf you listed is followed.