I am trying to build a Movie Plot (content) based recommender function in python3 to which takes a movie title as an argument and outputs movies with most similar plots.
My wrangled data has Shape of (45466, 8) This is what the head of wrangled data looks like:
I am using the fit-transform
method from sklearn.feature_extraction.text
's TfidVectorizer
to build the required TF-IDF matrix on the overview feature like so:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['overview'])
This results in a matrix of shape (45466, 75827) for the overview of every movie which means--after removing common stop words--there are 75827 distinct words in the overview soup of all the 45466 movies combined.
Post this I want to compute the pairwise cosine similarity score of every movie based on the tfidf matrix constructed above. This should give me a 45466 x 45466 matrix where the (i-th, j-th) cell would be the similarity score between movies i & j. I am using sklearn.metrics.pairwise
's linear_kernel
method to compute the same:
cos_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
This is where python3 throws out a Memory Error:
----
MemoryError Traceback (most recent call last)
<ipython-input-5-d884b8c29067> in <module>
1 #STEP 2: COMPUTING THE COSINE SIMILARITY MATRIX---------------------------
----> 2 cosine_sim = linear_kernel(tv_mat, tv_mat)
~/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in linear_kernel(X, Y, dense_output)
990 """
991 X, Y = check_pairwise_arrays(X, Y)
--> 992 return safe_sparse_dot(X, Y.T, dense_output=dense_output)
993
994
~/.local/lib/python3.6/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
153 if (sparse.issparse(a) and sparse.issparse(b)
154 and dense_output and hasattr(ret, "toarray")):
--> 155 return ret.toarray()
156 return ret
157
~/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
1023 if out is None and order is None:
1024 order = self._swap('cf')[0]
-> 1025 out = self._process_toarray_args(order, out)
1026 if not (out.flags.c_contiguous or out.flags.f_contiguous):
1027 raise ValueError('Output array must be C or F contiguous')
~/.local/lib/python3.6/site-packages/scipy/sparse/base.py in _process_toarray_args(self, order, out)
1187 return out
1188 else:
-> 1189 return np.zeros(self.shape, dtype=self.dtype, order=order)
1190
1191
MemoryError: Unable to allocate 15.4 GiB for an array with shape (45466, 45466) and data type float64
I have 8G RAM and 1G swap partition on a system running Ubuntu 18.04. How do I solve this problem?** Can't upgrade RAM soon enough.
tfidf_matrix
in half and compute the cosine similarity of each half with itself and the other half and put them back together. Would that work?TIA!
IMO, simplest solution was just increasing the swap space. I added a 15G swapfile using the following commands in order:
sudo fallocate -l 15G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Although the computation ran slower than it would have on actual RAM, this did solve my problem. Find a detailed answer here