What numpy workload would yield the highest speed up when having MKL, OPENBLAS installed? MMM? QR? SVD? I have tried with MMM but I don't see a speed up, to the contrary it gets worse. My test code is as follows:
import numpy as np
import time
import gc
sizes = [2000, 20000]
samples = np.zeros(5)
for n in sizes:
# allocate outside the benchmark
X = np.random.normal(size=(n, n))
Y = X
for i in range(2):
Z = X @ Y
# run 5 times
for r in range(5):
start_time = time.process_time()
Z = X @ Y
samples[r] = time.process_time() - start_time
X = Y = Z = None
# remove unnecessary memory consumption
gc.collect()
print(np.mean(samples))
print(np.std(samples))
As I re-run this increasing the number of available threads (max cores in this box are 8) I see the response time increasing instead of decreasing.
Not taking any chances and so I change all these env variables each time (here to 1 but I change to 2, 4, 8):
export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
You see the response time increasing instead of decreasing because the way to measure the time is likely not the correct one. What you probably want is the wall-clock time and not the cummulated parallel time. Indeed, time.process_time()
returns the "sum of the kernel and user-space CPU time". You can use time.time()
instead to fix this issue.
Note that the parallel time increases because of the overhead of using multiple threads : N threads do not the work N times faster but a bit less than N times (see Amdahl's law).
General information:
By default Numpy select an implementation available on the machine. On most Linux distribution, this is often already OpenBLAS which is used. So using explicitly OpenBLAS will not change the results.
The performance of the operations of a linear algebra library change regarding the use-case and the input type: a MMM with double-precision complex numbers likely not behave the same than the one with simple-precision real numbers. Some library are better optimized for some specific case and specific input types. Not to mention the impact of the target platform which is very important. The BLIS benchmark page shows clearly this. I will focus on general primitives working on double-precision real numbers running on a x86-64 Intel platform.
MMM is very compute-bound and scale very well. It is very optimized on most decent BLAS implementation. When a BLAS implementation do not optimize it, it generally means it do not optimize all the other primitives. OpenBLAS and the MKL are often close on this primitive on mainstream x86-64 platforms (the MKL often outperform it a bit).
QR is mostly memory-bound and much harder to optimize. AFAIK, the one provided in OpenBLAS is barely optimized. It is based on the Netlib implementation of LAPACK which is not very optimized too. The one in the MKL should be significantly better. These results can change in a near future. However, the memory throughput may limit the performance of both implementation resulting in similar performance result regarding the target hardware.
SVD is the hardest operation to optimize compared to the two others. This is also be the most expensive. The SVD is a LAPACK operation and thus the situation is similar to the QR one. Thus, I expect the MKL to clearly outperform OpenBLAS on this operation.