Search code examples
python-3.xout-of-memorytokenize

MemoryError: Unable to allocate 7.74 TiB for an array with shape (287318, 3704243) and data type float64


I am working on a matrix tfidf_matrix of shape (287318, 3704243) which i am trying to reuse for later computation. Here is my full code

tfidf_vectorizer = TfidfVectorizer()                                    
# text shape is (287318,)
tfidf_matrix  = tfidf_vectorizer.fit_transform(text)
X = tfidf_matrix.todense()  # error here

pca_num_components = 2
reduced_data = PCA(n_components=pca_num_components).fit_transform(X)

I am trying to reduce that tfidf_matrix by PCA for plotting purpose but i get a memory error issue at line X = tfidf_matrix.todense() saying

MemoryError: Unable to allocate 7.74 TiB for an array with shape (287318, 3704243) and data type float64

Is there any way to solve the problem please?


Solution

  • A possible solution (although not perfect) is to randomly pick a specific number of rows and perform PCA on it as follows.

    max_items = np.random.choice(range(tfidf_matrix.shape[0]), size=3000, replace=False)
    X=tfidf_matrix[max_items,:].todense()   
    pca = PCA(n_components=2).fit_transform(X)
    

    We can change the size parameter if needed