Search code examples
pythonscikit-learnnlp

Is there a way to solve word analogies using the sklearn TF-IDF model?


I have fit a TF-IDF model using Python's sklearn library using my own dataset:

tfidf_featuriser = sklearn.feature_extraction.text.TfidfVectorizer(stop_words=None)
tfidf_featuriser.fit(documents)
tfidf_docterm_matrix = tfidf_featuriser.transform(documents)

I am trying to solve word analogies (man::king as woman::queen) as it's possible to do with gensim's Word2Vec model. I have tried the following so far:

vec1 = tfidf_docterm_matrix.transpose()[tfidf_featuriser.vocabulary_['man'], :]
vec2 = tfidf_docterm_matrix.transpose()[tfidf_featuriser.vocabulary_['woman'], :]
vec3 = tfidf_docterm_matrix.transpose()[tfidf_featuriser.vocabulary_['king'], :]

vec4 = vec2 + vec3 - vec1

How can I retrieve similar vectors to vec4, hoping that one of the word vectors is of "queen"?


Solution

  • tf-idf does not [attempt to] capture semantic information about individual words - it is a purely frequency-based model. As such, you shouldn't expect to see neat word analogies pop up (think about it, why should the relative frequencies of 'man', 'woman', 'king' and 'queen' be so neatly related).

    In a Word2Vec model we have queen ~= king + woman - man word analogies emerge in part because we represented as n-dimensional vectors that (hopefully) encode the semantics of each word.

    In a tf-idf matrix, on the other hand, each element of our word vector just represents a function of its frequency in a particular document, so the constraint you're placing is not only that the relative frequency of these words be strongly, correlated, but that this occurs at the level of individual documents, which is a big ask for a model that just counts word frequencies.

    If you'd like to understand why word analogies emerge in Word Embedding models like Word2Vec I'd recommend having a look at this paper and the associated talk.