Search code examples
python-3.xmachine-learningnlpgensimword2vec

how to get tf-id from w2v on gensim


I have a textual dataset on which I trained a gensim w2v model. Now I want to use those vectors to recive the tf-idf values for the words and documents in my data set. What is the right way to do it? I tried to followe the tutorial on gensim's site.

I expect something like models.tfidfmodel(model.wv[model.wv.index2word]) but this fail since

File "<ipython-input-229-7946418f8a82>", line 1, in <module> models.tfidfmodel(model.wv[model.wv.index2word]) TypeError: 'module' object is not callable

does what I want makes since? Is BOW the only way to do that?


Solution

  • The tutorial you have linked to the model is given the corpus, i.e. the text (or transformed text) as a whole.

    What you have tried to do is give the model the dictionary that the w2v model learned.

    If what you want is

    to recive the tf-idf values for the words and documents in my data set.

    Then you should simply pass it as such:

    tfidf = models.TfidfModel(corpus)
    

    If what you actually want is to run the TF-IDF model on the transformed corpus, then you should first use your w2v to transform the corpus and then pass the transformed corpus to the tfidfmodel.


    Note that as the tfidf model simply calculates the word frequency there is nothing to be gained by giving it the transformed corpus and not the original one.