numpy performance scipy linear-algebra eigenvalue

Scipy banded eigensolver much slower than standard eigensolver

I'm observing a strange behaviour concerning the scipy.linalg.eig_banded eigensolver.

I am generating banded matrices of size N=p*f that have a specific structure. The matrices are symmetric tri-block-diagonal with p blocks of size fxf on the main diagonal and p-1 identity matrices of size f*f on the off diagonals.

Example with p=3 and f=3:

 [2 2 2 1 0 0 0 0 0]
 [2 2 2 0 1 0 0 0 0]
 [2 2 2 0 0 1 0 0 0]
 [1 0 0 3 3 3 1 0 0]
 [0 1 0 3 3 3 0 1 0]
 [0 0 1 3 3 3 0 0 1]
 [0 0 0 1 0 0 4 4 4]
 [0 0 0 0 1 0 4 4 4]
 [0 0 0 0 0 1 4 4 4]

Usually these matrices are of size p = 100, f=30, N=p*f=3000 but can easily grow much larger.

Given the structure of these matrices I was hoping that the banded eigensolver in scipy was going to be much faster than the dense eigensolver, however it seems like this is not the case.

I am benchmarking the solvers with the following code:

# Set dimension of problem
f = 50
p = 80
a = 1

print(f"p={p}, f={f}, size={f*p, f*p}")

print(f"Matrix containing random numbers in {(-a, a)}")
A = generate_matrix(p, f, -a, a)

# Benchmark standard eigensolver
start = time()
D, Q = linalg.eigh(A)
end = time()

# Test correctness
D = np.diag(D)
print(f"Time for dense solver {end - start}")
print(f"||AQ - QD|| = {np.linalg.norm(A@Q - Q@D)}")


# Convert A to banded format
A_banded = banded_format(A, upper = f)

# Benchmark banded eigensolver
start = time()
D, Q = linalg.eig_banded(A_banded)
end = time()

# Test correctness
D = np.diag(D)
print(f"Time for banded solver {end - start}")
print(f"||AQ - QD|| = {np.linalg.norm(A@Q - Q@D)}")

The results I get indicate that the banded eigensolver is much slower than the dense one:

p=80, f=50, size=(4000, 4000)
Matrix containing random numbers in (-1, 1)

Time for dense solver 13.475645780563354
||AQ - QD|| = 3.1334336527852233e-12

Time for banded solver 24.427151203155518
||AQ - QD|| = 1.589349711533356e-11

I have already tried storing the matrix in lower diagonal format and passing the overwrite_a_band=True option, but the performance remains the same.

Numpy configuration:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']

Scipy configuration:

lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/cluster/apps/gcc-8.2.0/openblas-0.2.20-5gatj7a35vypgjekzf3ibbtz54tlbk3m/lib']

I also tried running the same benchmark on a different cluster using MKL as a backend instead of OpenBLAS and I observed very similar results. Also setting the number of threads with OMP_NUM_THREADS and/or MKL_NUM_THREADS has a very small effect on performance.

Does anyone have any ideas on why this is happening?

Thanks

Solution

I did some digging into the source code of SciPy and the Intel MKL documentation and I have figured out why this is happening.

The scipy eig_banded solver delegates the problem to the LAPACK dsbevd routine which computes all eigenvalues and eigenvectors of a matrix in banded format using a variation of the Cuppen divide and conquer algorithm. This offers an advantage in terms of memory usage because of the banded storage format, but the actual algorithm scales in O(n^3) flops after tridiagonalization with respect to matrix size.

On the other hand, the scipy dense eigensolver delegates the problem to the dsyev routine which for real symmetric matrices calls the dsyevr routine which computes the eigenvalues and eigenvectors using the MRRR algorithm in O(n^2) flops after tridiagonalization.

I am still unsure why there is no MRRR implementation for banded matrix format in MKL.