Efficiently slice triangular sparse matrix

I have a sparse, triangular matrix (e.g. a distance matrix). In reality this would be a > 1M x 1M distance matrix with high sparsity.

from scipy.sparse import csr_matrix
X = csr_matrix([
      [1, 2, 3, 3, 1],
      [0, 1, 3, 3, 2],
      [0, 0, 1, 1, 3],
      [0, 0, 0, 1, 3],
      [0, 0, 0, 0, 1],
])

I want to subset this matrix to another triangular distance matrix. The indexes may be ordered differently and/or duplicated.

idx = np.matrix([1,2,4,2])
X2 = X[idx.T, idx]

This may result in the resulting matrix not being triangular, with some values missing from the upper triangle, and some values being duplicated in the lower triangle.

>>> X2.toarray()
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])

How can I get the correct upper triangle matrix as efficiently as possible? Currently, I mirror the matrix before subsetting, and subset it to the triangle afterwards, but this doesn't feel particularly efficient, as it requires, at least, duplication of all entries.

# use transpose method, see https://stackoverflow.com/a/58806735/2340703
X = X + X.T - scipy.sparse.diags(X.diagonal())
X2 = X[idx.T, idx]
X2 = scipy.sparse.triu(X2, k=0, format="csr")

>>> X2.toarray()
array([[1., 3., 2., 3.],
       [0., 1., 3., 1.],
       [0., 0., 1., 3.],
       [0., 0., 0., 1.]])

Solution

To summarize all the excellent contributions, the short answer to this question is:

Don't use triangular matrices. There's nothing to gain in terms of speed or memory compared to using square matrices.

The reason for this is explained in @hpaulj's answer:

Slicing on sparse matrices uses matrix multiplication which is super-efficient. Rearranging the matrix into a triangular shape will be slow.
Using triu is an expensive operation since it materializes a dense mask.

This becomes evident when comparing @jakevdp's solution with just using a square matrix. Using the square form is faster and uses less memory.

The test uses a sparse triangular 800k x 800k distance matrix with high sparsity (%nnz << 1%). Data and code are available here.

# Running benchmark: Converting to square matrix
./benchmark.py squarify   6.29s  user 1.59s system 80% cpu 9.738 total
max memory:                4409 MB

# Running benchmark: @jakevdp's solution
./benchmark.py sparse_triangular   67.03s  user 3.01s system 99% cpu 1:10.15 total
max memory:                5209 MB

If one desperately wants to optimize this beyond using square matrices, @CJR's comment is a good starting point:

I would consider implementing this as a compressed distance matrix in the same style that pdist does, but as a 1xN CSR matrix, and then using coordinate math to reindex it when you need to get specific values.