Search code examples
pythonmatrixsparse-matrixentropy

How do I calculate the column-wise information entropy of a large sparse probability matrix


I have converted my corpus (2 million documents) into a Bag-of-Words sparse matrix using sklearn's CountVectorizer. The shape of the sparse matrix is around 2000000 x 170000 (ie: 170k words in the corpus vocabulary).

I'm inexperienced with working on sparse matrices but have managed to perform simple calculations on it like calculating the variance of each word in the whole corpus since it involves simple mean and square operations matrices.

The issue that I am having now is that I do not know how to efficiently calculate the column wise entropy of the sparse matrix. Currently, I'm looping through each column and providing the word occurance probabilities as a list to scipy.stats.entropy which is taking very long due to the size of the sparse matrix.

An example for clarity:

# P: Column-wise word probability sparse matrix
P = [[0.2, 0.0, 0.5, 0.3, 0.0, 0.0],
     [0.5, 0.5, 0.5, 0.6, 1.0, 0.0],
     [0.0, 0.0, 0.0, 0.1, 0.0, 0.5],
     [0.3, 0.5, 0.0, 0.0, 0.0, 0.5]]

from scipy.stats import entropy
entropy_list = []
for index in range(P.shape[1]):
    entropy_list.append(entropy(P[:,index].todense()))

I expect to obtain an array of length 170000 since I am calculating the entropy of each word in the corpus vocabulary. So far, timing my current code, it takes around 25 minutes to calculate the entropy of 10000 words. At this rate it will take 7 hours to complete my calculations. Can anyone please help me find a more efficient method?


Solution

  • Entropy H(X) = - sum(p(X) * log(p(X)))

    logP = np.ma.log(P).filled(0)
    entropy_list = -np.sum(np.multiply(P, logP), axis=0)
    

    Note: In the case where the columns don't sum to 1, scipy.stats.entropy normalizes them.

    Edit: For scipy.sparse.csr_matrix

    log_result = np.log(P.data)
    logP = P._with_data(log_result, copy=True)
    mult_P = P.multiply(logP)
    entropy_list = -(mult_P.sum(axis=0))