Fast tensor-dot on sparse arrays with GPU in any programming language?

I'm now working on two multi-dimensional arrays arr and cost. arr is dense with size (width, height, m, n) and cost is sparse with size (width, height, width, height).

Values: width and height are around 5000, m is less than 100, and n is less than 10. The sparse cost is strictly block-sparse. For cost[i,j,:,:], there are only a certain k*k block, where k is less than 50.

Now I want this:

result = np.tensordot(cost, arr, axes=[[2,3],[0,1]])

The size of result is then (width, height, m, n) (the same as arr). However the cost array is too large if defined as a dense array.

So, the question is: How to do a fast tensor-dot (better with GPU) on sparse arrays?

Possible solutions are ideas in any programming language are OK, including but not limited to Python, C++, Julia, etc.

I've tried:

CPU versions (multithreaded): C++ (simply with std::vector), numpy/scipy, Julia

GPU versions: cupy, CUDA.jl

The C++ version works well with some simple for-loops on std::vectors, but it is not fast enough.

With cupy and CUDA.jl, they all work very fast on small tests with width and height set to small values and define both cost and arr are dense. But I don't know how to modify it to a sparse version.

Solution

I solved this finally via cupy with some reshapings.

I've implemented the Julia version tensor-dot with reshaping but I didn't realize this is the key to solving this 2d-sparse-matrix problem. Thanks for @hpaulj who reminded me of this.

The minimal example is:

import cupy as cp
import numpy as np
from cupyx.scipy import sparse as sparse_gpu
from scipy import sparse as sparse_cpu

# sparse_cost2d_cpu = get_cost_cpu(...)
data = cp.asarray(sparse_cost2d_cpu.data,dtype=DTYPE_CP)
indices = cp.asarray(sparse_cost2d_cpu.indices)
indptr = cp.asarray(sparse_cost2d_cpu.indptr)
sparse_cost2d_gpu = sparse_gpu.csr_matrix((data,indices,indptr),shape=(width*height,width*height),dtype=DTYPE_CP)

arr4d_cpu = np.random.random((width, height, nvar, nvalue)).astype(DTYPE_NP)
arr2d_cpu = arr.reshape((width*height,nvar*nvalue))
arr2d_gpu = cp.asarray(arr2d_cpu)

for i in range(100):
    new_arr2d_gpu = sparse_cost2d_gpu.dot(arr2d_gpu)
    arr2d_gpu = new_arr2d_gpu*_lambda + arr2d_gpu*(1-_lambda)
    arr4d_gpu = arr2d_gpu.reshape((width,height,nvar,nvalue))
    # Some other computation with arr4d_gpu
    arr2d_gpu = arr4d_gpu.reshape((width*height,nvar*nvalue))