How to efficiently calculate pairwise intersection of nonzero indices in a scipy.csr sparse matrix?

I have a scipy.sparse.csr matrix X which is n x p. For each row in X I would like to compute the intersection of the non zero element indices with each row in X and store them in a new tensor or maybe even a dictionary. For example, X is:

X = [
[0., 1.5, 4.7],
[4., 0., 0.],
[0., 0., 2.6]
]

I would like the output to be

intersect = 
[
[[1,2], [], [2]],
[[], [0], []],
[[2], [], [2]]
]

intersect[i,j] is an ndarray representing the intersection of the indices of nonzero elements of ith and jth rows of X i.e X[i], X[j].

Currently the way I am doing this is by looping and I would like to vectorize this as it would be much faster and the computations are done in parallel.

# current code
n = X.shape[0]
intersection_dict = {}
for i in range(n):
    for j in range(n):
        indices = np.intersect1d(X[i].indices, X[j].indices)
        intersection_dict[(i,j)] = indices

My n is pretty large so looping over n^2 is very poor. I am just having trouble figuring out a way to vectorize this operation. Does anybody have any ideas on how to tackle this?

EDIT: It was made apparent that I should explain the problem I am trying to solve, so here it is.

I am solving an optimization problem and have an equation W = X diag(theta) X'. I want to find W in a quick manner as I update the entries of theta till convergence. Further I am updating parameters using pytorch where sparse operations are not as extensive as in scipy.

where:

X : n x p sparse data matrix (n documents, p features)
theta : p x 1 parameter vector I want to learn and will be updating
X' : p x n transpose of sparse data matrix

note p >> n

I had in mind two methods of solving this quickly

Cache sparse outer product of (see More efficient matrix multiplication with diagonal matrix)
W_ij = X_i * theta * X_j (element wise product of row i of X, theta, and row j of X. And since X_i, X_j are sparse I was thinking if I take the intersection of the nonzero indices then I can do a simple dense elementwise product (sparse element wise product not supported in pytorch) of X_i[intersection indices] * theta[intersection indices] X_j[intersection indices]

I want to vectorize as much of this computation as possible rather than loop as my n is typically in the thousands and p is 11 million.

I am attempting method 2 over method 1 do to the lack of sparse support in Pytorch. Mainly when updating the entries of theta I would not like to do sparse-dense or sparse-sparse operations. I want to do dense-dense operations.

Solution

The optimization you're looking at requires storing p different n x n matrices. If you do want to try it, I'd probably use all the functionality built into sparse matrices in scipy's C extensions.

import numpy as np
from scipy import sparse

arr = sparse.random(100,10000, format="csr", density=0.01)

xxt = arr @ arr.T
p_comps = [arr[:, i] @ arr.T[i, :] for i in range(arr.shape[1])]

def calc_weights(xxt, thetas, p_comps):
    xxt = xxt.copy()
    xxt.data = np.zeros(xxt.data.shape, dtype=xxt.dtype)
    for i, t in enumerate(thetas):
        xxt += (p_comps[i] * t)
    return xxt

W = calc_weights(xxt, np.ones(10000), p_comps)

>>>(xxt.A == W.A).all()
True

It's really unlikely that this is going to work well implemented in python. You may have better luck doing this in C, or writing something with nested loops that operates on elements and is amenable to getting JIT compiled with numba.