python out-of-memory sparse-matrix cosine-similarity

Memory Error TFIDF cosine similarity in python

There's a large dataset with items descriptions. It contains item ID's and text description of it. One can build a cosine similarity matrix for tf_idf values for terms in descriptions.

My dataset contains descriptions for 300336 items. I've got a MemmoryError when try to execute my python code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import *

tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 1),
                         min_df=0)
tfidf_mx = tf.fit_transform(df.text)
cosine_similarities = linear_kernel(tfidf_mx)

I've tried also another way

sim_mx = cosine_similarity(tfidf_mx, dense_output=False)

But it gives me a MemoryError too.

May be there's upper limit even on sparse matrix for cosine similarities computation?

Do you know why MemoryError occurs and how to treat it?

Solution

The MemoryError occurs because the output is (a) ridiculously large and (b) dense, regardless of whether it's stored in a sparse or dense matrix.

(a) If the input contains n items, there are n * (n - 1) similarities to compute and return. (Since sim(i, j) = sim(j, i), there are really just n * (n - 1) / 2 similarities, but the matrix lists each one twice.) With 300336 items, the resulting matrix will contain 90 billion entries. That's about 720 G of space, I believe.

(b) If most of these entries were 0, then a sparse matrix would save space. But often that's not the case with similarity scores. Cosine(i,j) will be 0, for example, only for pairs of items that have 0 words in common.