Search code examples
pythonnlptf-idfgensimcosine-similarity

Python tf-idf: fast way to update the tf-idf matrix


I have a dataset of several thousand rows of text, my target is to calculate the tfidf score and then cosine similarity between documents, this is what I did using gensim in Python followed the tutorial:

dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(text) for text in dat]

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
index = similarities.MatrixSimilarity(corpus_tfidf)

Let's say we have the tfidf matrix and similarity built, when we have a new document come in, I want to query for its most similar document in our existing dataset.

Question: is there any way we can update the tf-idf matrix so that I don't have to append the new text doc to the original dataset and recalculate the whole thing again?


Solution

  • I'll post my solution since there are no other answers. Let's say we are in the following scenario:

    import gensim
    from gensim import models
    from gensim import corpora
    from gensim import similarities
    from nltk.tokenize import word_tokenize
    import pandas as pd
    
    # routines:
    text = "I work on natural language processing and I want to figure out how does gensim work"
    text2 = "I love computer science and I code in Python"
    dat = pd.Series([text,text2])
    dat = dat.apply(lambda x: str(x).lower()) 
    dat = dat.apply(lambda x: word_tokenize(x))
    
    
    dictionary = corpora.Dictionary(dat)
    corpus = [dictionary.doc2bow(doc) for doc in dat]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    
    
    #Query:
    query_text = "I love icecream and gensim"
    query_text = query_text.lower()
    query_text = word_tokenize(query_text)
    vec_bow = dictionary.doc2bow(query_text)
    vec_tfidf = tfidf[vec_bow]
    

    if we look at:

    print(vec_bow)
    [(0, 1), (7, 1), (12, 1), (15, 1)]
    

    and:

    print(tfidf[vec_bow])
    [(12, 0.7071067811865475), (15, 0.7071067811865475)]
    

    FYI id and doc:

    print(dictionary.items())
    
    [(0, u'and'),
     (1, u'on'),
     (8, u'processing'),
     (3, u'natural'),
     (4, u'figure'),
     (5, u'language'),
     (9, u'how'),
     (7, u'i'),
     (14, u'code'),
     (19, u'in'),
     (2, u'work'),
     (16, u'python'),
     (6, u'to'),
     (10, u'does'),
     (11, u'want'),
     (17, u'science'),
     (15, u'love'),
     (18, u'computer'),
     (12, u'gensim'),
     (13, u'out')]
    

    Looks like the query only picked up existing terms and using pre-calculated weights to give you the tfidf score. So my workaround is to rebuild the model weekly or daily since it is fast to do so.