Search code examples
pythonscikit-learngensimtf-idfcsr

Calculating similarity between Tfidf matrix and predicted vector causes memory overflow


I am have generated a tf-idf model on ~20,000,000 documents using the following code, which works well. The problem is when I try to calculate similarity scores when using linear_kernel the memory usage blows up:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

train_file = "docs.txt"
train_docs = DocReader(train_file) #DocReader is a generator for individual documents

vectorizer = TfidfVectorizer(stop_words='english',max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs)

#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])

#This is where the memory blows up
similarities = linear_kernel(invec, X).flatten()

Seems like this shouldn't take up much memory, doing a comparison of a 1-row-CSR to a 20mil-row-CSR should output a 1x20mil ndarray.

Justy FYI: X is a CSR matrix ~12 GB in memory (my computer only has 16). I have tried looking into gensim to replace this but I can't find a great example.

Any thoughts on what I am missing?


Solution

  • You can do the processing in batches. Here is an example based on your code snippet but replacing the dataset to something in sklearn. For this smaller dataset, I compute it the original way as well to show that the results are equivalent. You can probably use a larger batchsize.

    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import linear_kernel
    from sklearn.datasets import fetch_20newsgroups
    
    train_docs = fetch_20newsgroups(subset='train')
    
    vectorizer = TfidfVectorizer(stop_words='english', max_df=0.2,min_df=5)
    X = vectorizer.fit_transform(train_docs.data)
    
    #predicting a new vector, this works well when I check the predictions
    indoc = "This is an example of a new doc to be predicted"
    invec = vectorizer.transform([indoc])
    
    #This is where the memory blows up
    batchsize = 1024
    similarities = []
    for i in range(0, X.shape[0], batchsize):
        similarities.extend(linear_kernel(invec, X[i:min(i+batchsize, X.shape[0])]).flatten())
    similarities = np.array(similarities)
    similarities_orig = linear_kernel(invec, X)
    print((similarities == similarities_orig).all())
    

    Output:

    True