PPMI of scipy sparse matrix

I need to transform a scipy matrix in sparse csr format into a PPMI-weighted matrix. I have a sparse squared co-occurrence matrix with each row and column corresponding to words, and each entry mat(i, j) corresponding to the number of times these words were found together in a corpus.

Here is a minimal example of how to obtain this matrix:

from sklearn.feature_extraction.text import CountVectorizer

sentences = ["The cat is on the table",
             "I have seen a cat in the office",
             "You shall feed the cat before it gets dark",
             "I have many pets in my house, but my favourite is my cat",
             "Dogs are nice, but cats are far nicer in my opinion"]
count_model = CountVectorizer(ngram_range=(1,1))
X = count_model.fit_transform(sentences) # word-by-context matrix
Xc = (X.T * X) # word-by-word co-occurrence matrix in sparse csr format
Xc.setdiag(0)

What I want is to convert each of the matrix cells into the PPMI of that value

PPMI(i, j) = max(log2[P(i, j)/P(i)*P(j)], 0)

Now I have this awfully slow function that I use to compute PPMI values for (i, j), but I was wondering whether there was a more efficient solution, since this one is not scalable to the whole matrix (the toy matrix I posted is 29X29, but my matrix is 65,000X65,000).

def ppmi(matrix, idx1, idx2):
    tot = matrix.count_nonzero()
    p_a = sum(matrix[idx1, :].toarray()[0])/tot # probability of first element
    p_b = sum(matrix[idx2, :].toarray()[0])/tot # probability of second element
    p_ab = matrix[idx1, idx2]/tot # probability of co-occurrence
    ppmi = max([np.log2(p_ab/(p_a*p_b)), 0])
    return ppmi

Thank you!

Solution

Let's try a whole-array version of your ppmi:

def foo(matrix):
    tot = matrix.count_nonzero()
    p_a = matrix.sum(axis=1).A1/tot  # (n,) array
    pouter = p_a[:,None]*p_a
    p_ab = matrix/tot # probability of co-occurrence
    ppmi = np.log2(p_ab/(pouter))
    ppmi = np.maximum(ppmi, 0)
    return ppmi

Your sample matrix:

In [72]: Xc
Out[72]: 
<29x29 sparse matrix of type '<class 'numpy.int64'>'
    with 313 stored elements in Compressed Sparse Column format>

In [73]: M=foo(Xc)
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\2466548606.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = np.log2(p_ab/(pouter))

In [74]: M.shape
Out[74]: (29, 29)

In contrast, iterating on all index:

In [75]: res = np.array([[ppmi(Xc,i,j) for j in range(29)] for i in range(29)])
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\132281081.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = max([np.log2(p_ab/(p_a*p_b)), 0])

In [76]: res.shape
Out[76]: (29, 29)

The values match (though M is np.matrix):

In [77]: np.allclose(res,M)
Out[77]: True

Without the np.maximum the foo result has a lot of -inf. Also I'm not sure where the conversion from sparse matrix to dense matrix occurs. That could pose problems for larger cases.

Anyways the timings:

In [78]: timeit res = np.array([[ppmi(Xc,i,j) for j in range(29)] for i in range(29)])
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\132281081.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = max([np.log2(p_ab/(p_a*p_b)), 0])
452 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: timeit M = foo(Xc)
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\2466548606.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = np.log2(p_ab/(pouter))
438 µs ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This needs more work, but it shows that a lot of the single element calculations in ppmi can be performed for the whole array at once.