Search code examples

Memory Error TFIDF cosine similarity in python

There's a large dataset with items descriptions. It contains item ID's and text description of it. One can build a cosine similarity matrix for tf_idf values for terms in descriptions.

My dataset contains descriptions for 300336 items. I've got a MemmoryError when try to execute my python code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import *

tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 1),
tfidf_mx = tf.fit_transform(df.text)
cosine_similarities = linear_kernel(tfidf_mx)

I've tried also another way

sim_mx = cosine_similarity(tfidf_mx, dense_output=False)

But it gives me a MemoryError too.

May be there's upper limit even on sparse matrix for cosine similarities computation?

Do you know why MemoryError occurs and how to treat it?


  • The MemoryError occurs because the output is (a) ridiculously large and (b) dense, regardless of whether it's stored in a sparse or dense matrix.

    (a) If the input contains n items, there are n * (n - 1) similarities to compute and return. (Since sim(i, j) = sim(j, i), there are really just n * (n - 1) / 2 similarities, but the matrix lists each one twice.) With 300336 items, the resulting matrix will contain 90 billion entries. That's about 720 G of space, I believe.

    (b) If most of these entries were 0, then a sparse matrix would save space. But often that's not the case with similarity scores. Cosine(i,j) will be 0, for example, only for pairs of items that have 0 words in common.