I am have generated a tf-idf model on ~20,000,000 documents using the following code, which works well. The problem is when I try to calculate similarity scores when using linear_kernel the memory usage blows up:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
train_file = "docs.txt"
train_docs = DocReader(train_file) #DocReader is a generator for individual documents
vectorizer = TfidfVectorizer(stop_words='english',max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs)
#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])
#This is where the memory blows up
similarities = linear_kernel(invec, X).flatten()
Seems like this shouldn't take up much memory, doing a comparison of a 1-row-CSR to a 20mil-row-CSR should output a 1x20mil ndarray.
Justy FYI: X is a CSR matrix ~12 GB in memory (my computer only has 16). I have tried looking into gensim to replace this but I can't find a great example.
Any thoughts on what I am missing?
You can do the processing in batches. Here is an example based on your code snippet but replacing the dataset to something in sklearn. For this smaller dataset, I compute it the original way as well to show that the results are equivalent. You can probably use a larger batchsize.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.datasets import fetch_20newsgroups
train_docs = fetch_20newsgroups(subset='train')
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs.data)
#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])
#This is where the memory blows up
batchsize = 1024
similarities = []
for i in range(0, X.shape[0], batchsize):
similarities.extend(linear_kernel(invec, X[i:min(i+batchsize, X.shape[0])]).flatten())
similarities = np.array(similarities)
similarities_orig = linear_kernel(invec, X)
print((similarities == similarities_orig).all())
Output:
True