I have a dataset of several thousand rows of text, my target is to calculate the tfidf score and then cosine similarity between documents, this is what I did using gensim in Python followed the tutorial:
dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(text) for text in dat]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
index = similarities.MatrixSimilarity(corpus_tfidf)
Let's say we have the tfidf matrix and similarity built, when we have a new document come in, I want to query for its most similar document in our existing dataset.
Question: is there any way we can update the tf-idf matrix so that I don't have to append the new text doc to the original dataset and recalculate the whole thing again?
I'll post my solution since there are no other answers. Let's say we are in the following scenario:
import gensim
from gensim import models
from gensim import corpora
from gensim import similarities
from nltk.tokenize import word_tokenize
import pandas as pd
# routines:
text = "I work on natural language processing and I want to figure out how does gensim work"
text2 = "I love computer science and I code in Python"
dat = pd.Series([text,text2])
dat = dat.apply(lambda x: str(x).lower())
dat = dat.apply(lambda x: word_tokenize(x))
dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(doc) for doc in dat]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
#Query:
query_text = "I love icecream and gensim"
query_text = query_text.lower()
query_text = word_tokenize(query_text)
vec_bow = dictionary.doc2bow(query_text)
vec_tfidf = tfidf[vec_bow]
if we look at:
print(vec_bow)
[(0, 1), (7, 1), (12, 1), (15, 1)]
and:
print(tfidf[vec_bow])
[(12, 0.7071067811865475), (15, 0.7071067811865475)]
FYI id and doc:
print(dictionary.items())
[(0, u'and'),
(1, u'on'),
(8, u'processing'),
(3, u'natural'),
(4, u'figure'),
(5, u'language'),
(9, u'how'),
(7, u'i'),
(14, u'code'),
(19, u'in'),
(2, u'work'),
(16, u'python'),
(6, u'to'),
(10, u'does'),
(11, u'want'),
(17, u'science'),
(15, u'love'),
(18, u'computer'),
(12, u'gensim'),
(13, u'out')]
Looks like the query only picked up existing terms and using pre-calculated weights to give you the tfidf score. So my workaround is to rebuild the model weekly or daily since it is fast to do so.