Search code examples
python-3.xscikit-learnubuntu-18.04cosine-similaritytfidfvectorizer

Memory Error when trying to compute Cosine Similarity Matrix on TFIDF vector


I am trying to build a Movie Plot (content) based recommender function in python3 to which takes a movie title as an argument and outputs movies with most similar plots.

My wrangled data has Shape of (45466, 8) This is what the head of wrangled data looks like:

enter image description here

I am using the fit-transform method from sklearn.feature_extraction.text's TfidVectorizer to build the required TF-IDF matrix on the overview feature like so:

tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(movies['overview'])

This results in a matrix of shape (45466, 75827) for the overview of every movie which means--after removing common stop words--there are 75827 distinct words in the overview soup of all the 45466 movies combined.

Post this I want to compute the pairwise cosine similarity score of every movie based on the tfidf matrix constructed above. This should give me a 45466 x 45466 matrix where the (i-th, j-th) cell would be the similarity score between movies i & j. I am using sklearn.metrics.pairwise's linear_kernel method to compute the same:

cos_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

This is where python3 throws out a Memory Error:

----
MemoryError                               Traceback (most recent call last)
<ipython-input-5-d884b8c29067> in <module>
      1 #STEP 2: COMPUTING THE COSINE SIMILARITY MATRIX---------------------------
----> 2 cosine_sim = linear_kernel(tv_mat, tv_mat)

~/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in linear_kernel(X, Y, dense_output)
    990     """
    991     X, Y = check_pairwise_arrays(X, Y)
--> 992     return safe_sparse_dot(X, Y.T, dense_output=dense_output)
    993 
    994 

~/.local/lib/python3.6/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    153     if (sparse.issparse(a) and sparse.issparse(b)
    154             and dense_output and hasattr(ret, "toarray")):
--> 155         return ret.toarray()
    156     return ret
    157 

~/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
   1023         if out is None and order is None:
   1024             order = self._swap('cf')[0]
-> 1025         out = self._process_toarray_args(order, out)
   1026         if not (out.flags.c_contiguous or out.flags.f_contiguous):
   1027             raise ValueError('Output array must be C or F contiguous')

~/.local/lib/python3.6/site-packages/scipy/sparse/base.py in _process_toarray_args(self, order, out)
   1187             return out
   1188         else:
-> 1189             return np.zeros(self.shape, dtype=self.dtype, order=order)
   1190 
   1191 

MemoryError: Unable to allocate 15.4 GiB for an array with shape (45466, 45466) and data type float64

I have 8G RAM and 1G swap partition on a system running Ubuntu 18.04. How do I solve this problem?** Can't upgrade RAM soon enough.

  • I could perhaps try this on with a much smaller dataset to begin with but that isn't the solution I am looking for.
  • I could perhaps split tfidf_matrix in half and compute the cosine similarity of each half with itself and the other half and put them back together. Would that work?
  • Is there any simpler solution that I might be missing?

TIA!


Solution

  • IMO, simplest solution was just increasing the swap space. I added a 15G swapfile using the following commands in order:

    sudo fallocate -l 15G /swapfile
    sudo chmod 600 /swapfile
    sudo mkswap /swapfile
    sudo swapon /swapfile
    

    Although the computation ran slower than it would have on actual RAM, this did solve my problem. Find a detailed answer here