numpy scikit-learn normalization sparse-matrix transformation

Take logarithm for values in a matrix in Compressed Sparse Row format (csr_matrix)

I am interested in taking logarithm of count data that I obtained from countvectorizing text data. I would love to test if this transformation (normalization) would help improve the performance of a model in sklearn.

This is what I have:

TEXT = [data[i].values()[3] for i in range(len(data))]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=0.01,max_df = 2.5, lowercase = False, stop_words = 'english')

X = vectorizer.fit_transform(TEXT)
X = [math.log(i+1) for i in X]

As I run this code, however, I obtain an error:

File "nlpQ2.py", line 29, in <module>
X = [math.log(i+1) for i in X]
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 337, in __add__
raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

Although I had no hope that this would actually work, I couldn't think of a way to take logarithm for values in a CSR matrix. I tried

import math
import numpy as np
from scipy.sparse import csr_matrix

A = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])

[math.log(i+1) for i in A]

This generates

NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

Is there a way to solve this? Thank you very much for your help.

Solution

You just need to convert the sparse matrix X to a dense array through the todense() method and then use NumPy's broadcasting to compute the logarithm:

X = np.log(1 + X)

If X is huge, converting it to a dense matrix may exhaust your RAM. In that case the method log1p() is your friend as it operates on sparse matrices:

X = X.log1p()