Why does using the GPU cause a slowdown when performing svd on a double-precision array?

I get the same results in both Julia and Python. Singular Value Decomposition is slower on the GPU than on the CPU for Float64 arrays. (Float32 arrays behave how one would expect, with the GPU being faster.)

Python benchmark code using PyTorch:

import time
import torch
from torch.linalg import svd

f64 = torch.double
f32 = torch.float

cpu = torch.device("cpu")
gpu = torch.device("cuda")


# X = torch.rand(5_000, 10_000, dtype=f32)
X = torch.rand(5_000, 10_000, dtype=f64)

X_cpu = X.to(cpu)
X_gpu = X.to(gpu)

print(X_cpu.type())
# Warmup
U, Sig, Vt = svd(X_cpu, full_matrices = True)
# Timed run (CPU)
t1 = time.perf_counter()
U, Sig, Vt = svd(X_cpu, full_matrices = True)
t2 = time.perf_counter()
print(U.type())


print(X_cpu.type())
# Warmup
U, Sig, Vt = svd(X_gpu, full_matrices = True)
# Timed run (GPU)
t3 = time.perf_counter()
U, Sig, Vt = svd(X_gpu, full_matrices = True)
t4 = time.perf_counter()
print(U.type())

print(f"Time CPU (s): {t2-t1}")
print(f"Time GPU (s): {t4-t3}")

For the above Float64 array I get:

Time CPU (s): 14.52491476599971
Time GPU (s): 56.79755901500175

If I use a Float32 array instead I get the much more reasonable looking:

Time CPU (s): 9.301500292000128
Time GPU (s): 6.969021153003268

Although it's still a little surprising that using the GPU only speeds things up by a couple of seconds.

The Julia code gives similar results:

using LinearAlgebra
using Flux
using CUDA
using cuDNN

X = rand(5_000, 10_000)
println("typeof(X): $(typeof(X))") 
# Warmup
U, Sig, V = LinearAlgebra.svd(X)
# Timed run
t1 = time_ns()
U, Sig, V = LinearAlgebra.svd(X)
t2 = time_ns()

println("typeof(U): $(typeof(U))") 

X_gpu = X |> gpu |> f64
println("typeof(X_gpu): $(typeof(X_gpu))") 

# Warmup
U, Sig, V = CUDA.svd!(X_gpu)
# Timed run
t3 = time_ns()
U, Sig, V = CUDA.svd!(X_gpu)
t4 = time_ns()
println("typeof(U): $(typeof(U))") 


println("Time CPU (s): $((t2-t1)/1e9)")
println("Time GPU (s): $((t4-t3)/1e9)")

For this Float64 array the GPU again takes much longer than the CPU:

Time CPU (s): 28.641290506
Time GPU (s): 57.069009417

However, switching to Float32 again yields a reasonable result:

Time CPU (s): 15.096364932
Time GPU (s): 7.283513658

Two questions:

Why do Float64 arrays run so poorly on the GPU? I am using an NVIDIA 40 series GPU if that is relevant.
Is there any way of improving the performance of svd run on the GPU? (Especially for Float64 arrays, but speeding up svd for Float32 arrays would also be nice.) One possible way would be by changing how svd is performed. I checked and there don't seem to be any optional arguments available for CUDA.jl's svd function. I tried setting full_matrices=False for PyTorch's svd function, but I got the same results.

Solution

All consumer level NVidia GPUs have limited support for 64-bit floating point operations.
Whereas every warp has 32 float32 cores, it only has a single float64 core.
In my experiments I typically find that float64 code runs about 8x slower (more if the floating point operations are heavily optimized). The slowdown is not typically 32 times as you'd expect due to latency issues and memory fetch times etc.
You really need to avoid double. If you really need the extra precision (which is rare), look at "kahan summation".

NVidia documents this in the CUDA programming guide

Compute Capability

Throughput/cycle/SM	5.0, 5.2	5.3	6.0	6.1	6.2	7.x	8.0	8.6	8.9	9.0
16-bit floating-point	N/A	256	128	2	256	128	256	256	128	256
32-bit floating-point	128	128	64	128	128	64	64	128	128	128
64-bit floating-point	4	4	32	4	4	32	32	2	2	64

Note how on all non-server hardware the double throughput is severely limited. This limits the amount of chip real-estate needed to support rarely used 64-bit operations, which allows NVidia to make consumer GPU using smaller and cheaper dies (or alternatively spend more silicon optimizing 32-bit performance).

The compute capability numbers refer to the microarchiture, 8.9 is Ada Lovelace (aka RTX 4000 series), 8.6 is Ampere (aka RTX 3000 series) and so on, see: https://en.wikipedia.org/wiki/CUDA

As for better performance, have a look at different libraries. Esp have a look at python wrappers around native CUDA libraries (such as CuSolver) that are optimized for the GPU.