Apply cupy.linalg.svd over a stack of matrices

I'd like to perform SVD (singular value decomposition) via cupy.linalg.svd over a stack of matrices, computing the SVD per-matrix.

import cupy as cp
arr = cp.random.uniform(size=(1000, 3, 3), dtype=cp.float32)
sing_vals = cp.linalg.svd(arr, compute_uv=False, full_matrices=False)

This gives an error claiming that arr is not 2D. Apparently, cp.linalg.svd can only compute the SVD of a single 2D matrix. However, numpy.linalg.svd computes the SVD always over the last two axes of an array which is much more powerful.

Is there a way to efficiently compute the SVD over a stack of matrices in cupy?

Better yet, is there a general method to efficiently apply_along_axis?

Solution

The current implementation of CuPy calls cusolverDn<t>gesvd(), which does not support batch computation. For an efficient batch computation, I suppose CuPy has to invoke a CUDA API that receives batched inputs.

FYI to improve CuPy, cuSOLVER has cusolverDn<t>gesvdjBatched() and cusolverDn<t>gesvdaStridedBatched(), which seem possible to be used for batched SVD (of dense general matrices). I don't have an idea on the difference among SVD algorithms. j stands for Jacobi method and a stands for approximate.