UPDATE
Unfortunately, due to my oversight, I had an older version of MKL (11.1) linked against numpy. Newer version of MKL (11.3.1) gives same performance in C and when called from python.
What was obscuring things, was even if linking the compiled shared libraries explicitly with the newer MKL, and pointing through LD_* variables to them, and then in python doing import numpy, was somehow making python call old MKL libraries. Only by replacing in python lib folder all libmkl_*.so with newer MKL I was able to match performance in python and C calls.
Background / library info.
Matrix multiplication was done via sgemm (single-precision) and dgemm (double-precision) Intel's MKL library calls, via numpy.dot function. The actual call of the library functions can be verified with e.g. oprof.
Using here 2x18 core CPU E5-2699 v3, hence a total of 36 physical cores. KMP_AFFINITY=scatter. Running on linux.
TL;DR
1) Why is numpy.dot, even though it is calling the same MKL library functions, twice slower at best compared to C compiled code?
2) Why via numpy.dot you get performance decreasing with increasing number of cores, whereas the same effect is not observed in C code (calling the same library functions).
The problem
I've observed that doing matrix multiplication of single/double precision floats in numpy.dot, as well as calling cblas_sgemm/dgemm directly from a compiled C shared library give noticeably worse performance compared to calling same MKL cblas_sgemm/dgemm functions from inside pure C code.
import numpy as np
import mkl
n = 10000
A = np.random.randn(n,n).astype('float32')
B = np.random.randn(n,n).astype('float32')
C = np.zeros((n,n)).astype('float32')
mkl.set_num_threads(3); %time np.dot(A, B, out=C)
11.5 seconds
mkl.set_num_threads(6); %time np.dot(A, B, out=C)
6 seconds
mkl.set_num_threads(12); %time np.dot(A, B, out=C)
3 seconds
mkl.set_num_threads(18); %time np.dot(A, B, out=C)
2.4 seconds
mkl.set_num_threads(24); %time np.dot(A, B, out=C)
3.6 seconds
mkl.set_num_threads(30); %time np.dot(A, B, out=C)
5 seconds
mkl.set_num_threads(36); %time np.dot(A, B, out=C)
5.5 seconds
Doing exactly the same as above, but with double precision A, B and C, you get: 3 cores: 20s, 6 cores: 10s, 12 cores: 5s, 18 cores: 4.3s, 24 cores: 3s, 30 cores: 2.8s, 36 cores: 2.8s.
The topping up of speed for single precision floating points seem to be associated with cache misses. For 28 core run, here is the output of perf. For single precision:
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./ptestf.py
631,301,854 cache-misses # 31.478 % of all cache refs
And double precision:
93,087,703 cache-misses # 5.164 % of all cache refs
C shared library, compiled with
/opt/intel/bin/icc -o comp_sgemm_mkl.so -openmp -mkl sgem_lib.c -lm -lirc -O3 -fPIC -shared -std=c99 -vec-report1 -xhost -I/opt/intel/composer/mkl/include
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C);
void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C)
{
int i, j;
float alpha, beta;
alpha = 1.0; beta = 0.0;
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
}
Python wrapper function, calling the above compiled library:
def comp_sgemm_mkl(A, B, out=None):
lib = CDLL(omplib)
lib.cblas_sgemm_mkl.argtypes = [c_int, c_int, c_int,
np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
np.ctypeslib.ndpointer(dtype=np.float32, ndim=2)]
lib.comp_sgemm_mkl.restype = c_void_p
m = A.shape[0]
n = B.shape[0]
k = B.shape[1]
if np.isfortran(A):
raise ValueError('Fortran array')
if m != n:
raise ValueError('Wrong matrix dimensions')
if out is None:
out = np.empty((m,k), np.float32)
lib.comp_sgemm_mkl(m, n, k, A, B, out)
However, explicit calls from a C-compiled binary calling MKL's cblas_sgemm / cblas_dgemm, with arrays allocated through malloc in C, gives almost 2x better performance compared to the python code, i.e. the numpy.dot call. Also, the effect of performance degradation with increasing number of cores is NOT observed. The best performance was 900 ms for single-precision matrix multiplication and was achieved when using all 36 physical cores via mkl_set_num_cores and running the C code with numactl --interleave=all.
Perhaps any fancy tools or advice for profiling/inspecting/understanding this situation further? Any reading material is much appreciated as well.
UPDATE Following @Hristo Iliev advice, running numactl --interleave=all ./ipython did not change the timings (within noise), but improves the pure C binary runtimes.
I suspect this is due to unfortunate thread scheduling. I was able to reproduce an effect similar to yours. Python was running at ~2.2 s, while the C version was showing huge variations from 1.4-2.2 s.
Applying:
KMP_AFFINITY=scatter,granularity=thread
This ensures that the 28 threads are always running on the same processor thread.
Reduces both runtimes to more stable ~1.24 s for C and ~1.26 s for python.
This is on a 28 core dual socket Xeon E5-2680 v3 system.
Interestingly, on a very similar 24 core dual socket Haswell system, both python and C perform almost identical even without thread affinity / pinning.
Why does python affect the scheduling? Well I assume there is more runtime environment around it. Bottom line is, without pinning your performance results will be non-deterministic.
Also you need to consider, that the Intel OpenMP runtime spawns an extra management thread that can confuse the scheduler. There are more choices for pinning, for instance KMP_AFFINITY=compact
- but for some reason that is totally messed up on my system. You can add ,verbose
to the variable to see how the runtime is pinning your threads.
likwid-pin is a useful alternative providing more convenient control.
In general single precision should be at least as fast as double precision. Double precision can be slower because:
I would think that once you get rid of the performance anomaly, this will be reflected in your numbers.
When you scale up the number of threads for MKL/*gemm, consider
I don't think there is a really simple way to measure how your application is affected by bad scheduling. You can expose this with perf trace -e sched:sched_switch
and there is some software to visualize this, but this will come with a high learning curve. And then again - for parallel performance analysis you should have the threads pinned anyway.