I have a fairly simple Cython module - which I simplified even further for the specific tests I have carried on.
This module has only one class, which has only one interesting method (named run
): this method accepts as an input a Fortran-ordered 2D NumPy array and two 1D NumPy arrays, and does some very, very simple things on those (see below for code).
For the sake of benchmarking, I have compiled the exact same module with MSVC, GCC 8, GCC 11, GCC 12 and GCC 13 on Windows 10 64 bit, Python 3.9.10 64 bit, Cython 0.29.32. All the GCC compilers I have obtained from the excellent Brecht Sanders GitHub page (https://github.com/brechtsanders).
The overarching question of this very long post is: I am just curious to know if anyone has any explanation regarding why GCC12 and GCC13 are so much slower than GCC11 (which is the fastest of all). Looks like performances are going down at each release of GCC, rather than getting better...
In the benchmarking, I simply vary the array dimensions of the 2D and 1D arrays (m
and n
) and the number on nonzero entries in the 2D and 1D arrays. I repeat the run
method 20 times per compiler version, per set of m
and n
and nonzero entries.
Optimization settings I am using:
MVSC
MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']
GCC
GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS
What I am observing is the following:
Table of Results:
Runtimes are in milliseconds (ms)
Graph (NOTE: Logarithmic Y axis!!):
Runtimes are in milliseconds (ms)
Cython file:
###############################################################################
import numpy as np
cimport numpy as np
import cython
from cython.view cimport array as cvarray
from libc.float cimport FLT_MAX
DTYPE_float = np.float32
ctypedef np.float32_t DTYPE_float_t
cdef float SMALL = 1e-10
cdef int MAXSIZE = 1000000
cdef extern from "math.h" nogil:
cdef float fabsf(float x)
###############################################################################
@cython.final
cdef class CythonLoops:
cdef int m, n
cdef int [:] col_starts
cdef int [:] row_indices
cdef double [:] x
def __cinit__(self):
self.m = 0
self.n = 0
self.col_starts = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
self.row_indices = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
self.x = cvarray(shape=(MAXSIZE,), itemsize=sizeof(double), format='d')
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False) # turn off negative index wrapping for entire function
@cython.nonecheck(False)
@cython.initializedcheck(False)
@cython.cdivision(True)
cpdef run(self, DTYPE_float_t[::1, :] matrix,
DTYPE_float_t[:] ub_values,
DTYPE_float_t[:] priority):
cdef Py_ssize_t i, j, m, n
cdef int nza, collen
cdef double too_large, ok, obj
cdef float ub, element
cdef int [:] col_starts = self.col_starts
cdef int [:] row_indices = self.row_indices
cdef double [:] x = self.x
m = matrix.shape[0]
n = matrix.shape[1]
self.m = m
self.n = n
nza = 0
collen = 0
for i in range(n):
for j in range(m+1):
if j == 0:
element = priority[i]
else:
element = matrix[j-1, i]
if fabsf(element) < SMALL:
continue
if j == 0:
obj = <double>element
# Do action 1 with external library
else:
collen = nza + 1
col_starts[collen] = i+1
row_indices[collen] = j
x[collen] = <double>element
nza += 1
ub = ub_values[i]
if ub > FLT_MAX:
too_large = 0.0
# Do action 2 with external library
elif ub > SMALL:
ok = <double>ub
# Do action 3 with external library
# Use x, row_indices and col_starts in the external library
Setup file:
I use the following to compile it:
python setup.py build_ext --inplace --compiler=mingw32 gcc13
Where the last argument is the compiler I want to test
#!/usr/bin/env python
from setuptools import setup
from setuptools import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy as np
import os
import shutil
import sys
import getpass
MODULE = 'loop_cython_%s'
GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS
MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']
MSVC_EXTRA_LINK_ARGS = MSVC_EXTRA_COMPILE_ARGS
def remove_builds(kind):
for folder in ['build', 'bin']:
if os.path.isdir(folder):
if folder == 'bin':
continue
shutil.rmtree(folder, ignore_errors=True)
if os.path.isfile(MODULE + '_%s.c'%kind):
os.remove(MODULE + '_%s.c'%kind)
def setenv(extra_args, doset=True, path=None, kind='gcc8'):
flags = ''
if doset:
flags = ' '.join(extra_args)
for key in ['CFLAGS', 'FFLAGS', 'CPPFLAGS']:
os.environ[key] = flags
user = getpass.getuser()
if doset:
path = os.environ['PATH']
if kind == 'gcc8':
os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_8.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
elif kind == 'gcc11':
os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
elif kind == 'gcc12':
os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_12.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
elif kind == 'gcc13':
os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_13.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
elif kind == 'msvc':
os.environ['PATH'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\Hostx64\x64;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\J0514162\WinPython39\WPy64-39100\python-3.9.10.amd64;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22000.0\x64'
os.environ['LIB'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\lib\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\um\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\ucrt\x64'
os.environ["DISTUTILS_USE_SDK"] = '1'
os.environ["MSSdk"] = '1'
else:
os.environ['PATH'] = path
return path
class CustomBuildExt(build_ext):
def build_extensions(self):
# Override the compiler executables. Importantly, this
# removes the "default" compiler flags that would
# otherwise get passed on to to the compiler, i.e.,
# distutils.sysconfig.get_var("CFLAGS").
self.compiler.set_executable("compiler_so", "gcc -mdll -O -Wall -DMS_WIN64")
self.compiler.set_executable("compiler_cxx", "g++ -O -Wall -DMS_WIN64")
self.compiler.set_executable("linker_so", "gcc -shared -static")
self.compiler.dll_libraries = []
build_ext.build_extensions(self)
if __name__ == '__main__':
os.system('cls')
kind = None
for arg in sys.argv:
if arg.strip() in ['gcc8', 'gcc11', 'gcc12', 'gcc13', 'msvc']:
kind = arg
sys.argv.remove(arg)
break
base_file = os.path.join(os.getcwd(), MODULE[0:-3])
source = base_file + '.pyx'
target = base_file + '_%s.pyx'%kind
shutil.copyfile(source, target)
if kind == 'msvc':
extra_compile_args = MSVC_EXTRA_COMPILE_ARGS[:]
extra_link_args = MSVC_EXTRA_LINK_ARGS[:] + ['/MANIFEST']
else:
extra_compile_args = GCC_EXTRA_COMPILE_ARGS[:]
extra_link_args = GCC_EXTRA_LINK_ARGS[:]
path = setenv(extra_compile_args, kind=kind)
remove_builds(kind)
define_macros = [('WIN32', 1)]
nname = MODULE%kind
include_dirs = [np.get_include()]
if kind == 'msvc':
include_dirs += [r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\ucrt',
r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include',
r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\shared']
extensions = [
Extension(nname, [nname + '.pyx'],
extra_compile_args=extra_compile_args,
extra_link_args=extra_link_args,
include_dirs=include_dirs,
define_macros=define_macros)]
# build the core extension(s)
setup_kwargs = {'ext_modules': cythonize(extensions,
compiler_directives={'embedsignature' : False,
'boundscheck' : False,
'wraparound' : False,
'initializedcheck': False,
'cdivision' : True,
'language_level' : '3str',
'nonecheck' : False},
force=True,
cache=False,
quiet=False)}
if kind != 'msvc':
setup_kwargs['cmdclass'] = {'build_ext': CustomBuildExt}
setup(**setup_kwargs)
setenv([], False, path)
remove_builds(kind)
Test code:
import os
import numpy
import time
import loop_cython_msvc as msvc
import loop_cython_gcc8 as gcc8
import loop_cython_gcc11 as gcc11
import loop_cython_gcc12 as gcc12
import loop_cython_gcc13 as gcc13
# M N NNZ(matrix) NNZ(priority) NNZ(ub)
DIMENSIONS = [(1661 , 2608 , 3560 , 375 , 2488 ),
(2828 , 3512 , 4333 , 413 , 2973 ),
(780 , 985 , 646 , 23 , 984 ),
(799 , 1558 , 1883 , 301 , 1116 ),
(399 , 540 , 388 , 44 , 517 ),
(10545, 10486, 14799 , 1053 , 10041),
(3369 , 3684 , 3684 , 256 , 3242 ),
(2052 , 5513 , 4772 , 1269 , 3319 ),
(224 , 628 , 1345 , 396 , 594 ),
(553 , 1475 , 1315 , 231 , 705 )]
def RunTest():
print('M N NNZ MSVC GCC 8 GCC 11 GCC 12 GCC 13')
for m, n, nnz_mat, nnz_priority, nnz_ub in DIMENSIONS:
print('%-6d %-6d %-8d'%(m, n, nnz_mat), end='')
for solver, label in zip([msvc, gcc8, gcc11, gcc12, gcc13], ['MSVC', 'GCC 8', 'GCC 11', 'GCC 12', 'GCC 13']):
numpy.random.seed(123456)
size = m*n
idxes = numpy.arange(size)
matrix = numpy.zeros((size, ), dtype=numpy.float32)
idx_mat = numpy.random.choice(idxes, nnz_mat)
matrix[idx_mat] = numpy.random.uniform(0, 1000, size=(nnz_mat, ))
matrix = numpy.asfortranarray(matrix.reshape((m, n)))
idxes = numpy.arange(m)
priority = numpy.zeros((m, ), dtype=numpy.float32)
idx_pri = numpy.random.choice(idxes, nnz_priority)
priority[idx_pri] = numpy.random.uniform(0, 1000, size=(nnz_priority, ))
idxes = numpy.arange(n)
ub_values = numpy.inf*numpy.ones((n, ), dtype=numpy.float32)
idx_ub = numpy.random.choice(idxes, nnz_ub)
ub_values[idx_ub] = numpy.random.uniform(0, 1000, size=(nnz_ub, ))
solver = solver.CythonLoops()
time_tot = []
for i in range(20):
start = time.perf_counter()
solver.run(matrix, ub_values, priority)
elapsed = time.perf_counter() - start
time_tot.append(elapsed*1e3)
print('%-8.4g'%numpy.mean(time_tot), end=' ')
print()
if __name__ == '__main__':
os.system('cls')
RunTest()
After @PeterCordes comments, I have changed the optimization flags to this:
MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL', '/QIntel-jcc-erratum']
GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables', '-Wa,-mbranches-within-32B-boundaries']
MSVC appears to be marginally faster than before (between 5% and 10%), but GCC12 and GCC13 are slower than before (between 3% and 20%). Below a graph with the results on the largest 2D matrix:
Note: "Current" means with the latest optimization flags suggested by @PeterCordes, "Previous" is the original set of flags.
This is a partial answer providing the generated C code produced by Cython once it has been simplified a bit to be shorter, more human-readable and easy to compile without any Cython, Python, or NumPy dependencies (the transformations are not expected to drastically impact the timings). It also show the generated assembly code and few comments.
Here is the C code:
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <float.h>
/* Remove external dependencies */
typedef int64_t Py_ssize_t;
typedef void memoryview_obj;
typedef void* VTable;
typedef struct {
struct memoryview_obj *memview;
char *data;
Py_ssize_t shape[8];
Py_ssize_t strides[8];
Py_ssize_t suboffsets[8];
} memviewslice;
struct Self
{
/* PyObject_HEAD */
struct VTable* tab;
int m;
int n;
memviewslice col_starts;
memviewslice row_indices;
memviewslice x;
};
static float test_SMALL = 1e-10;
extern void INC_MEMVIEW(memviewslice* slice, int count);
extern void XDEC_MEMVIEW(memviewslice* slice, int count);
void run(struct Self* self, memviewslice matrix, memviewslice ub_values, memviewslice priority)
{
Py_ssize_t i;
Py_ssize_t j;
Py_ssize_t m;
Py_ssize_t n;
int nza;
int collen;
double too_large;
double ok;
double obj;
float ub;
float element;
memviewslice col_starts = { 0, 0, { 0 }, { 0 }, { 0 } };
memviewslice row_indices = { 0, 0, { 0 }, { 0 }, { 0 } };
memviewslice x = { 0, 0, { 0 }, { 0 }, { 0 } };
memviewslice t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
memviewslice t_2 = { 0, 0, { 0 }, { 0 }, { 0 } };
Py_ssize_t t_3;
Py_ssize_t t_4;
Py_ssize_t t_6;
Py_ssize_t t_7;
int t_9;
Py_ssize_t t_10;
Py_ssize_t t_11;
t_1 = self->col_starts;
INC_MEMVIEW(&t_1, 1);
col_starts = t_1;
t_1.memview = NULL;
t_1.data = NULL;
t_1 = self->row_indices;
INC_MEMVIEW(&t_1, 1);
row_indices = t_1;
t_1.memview = NULL;
t_1.data = NULL;
t_2 = self->x;
INC_MEMVIEW(&t_2, 1);
x = t_2;
t_2.memview = NULL;
t_2.data = NULL;
m = matrix.shape[0];
n = matrix.shape[1];
self->m = m;
self->n = n;
nza = 0;
collen = 0;
t_3 = n;
t_4 = t_3;
for (i = 0; i < t_4; i++)
{
t_6 = m + 1;
t_7 = t_6;
for (j = 0; j < t_7; j++)
{
t_9 = (j == 0) != 0;
if (t_9)
{
t_10 = i;
element = (*((float*) ( /* dim=0 */ (priority.data + t_10 * priority.strides[0]) )));
goto L7;
}
{
t_10 = j - 1;
t_11 = i;
element = (*((float*) ( /* dim=1 */ (( /* dim=0 */ ((char *) (((float*) matrix.data) + t_10)) ) + t_11 * matrix.strides[1]) )));
}
L7:;
t_9 = (fabsf(element) < test_SMALL) != 0;
if (t_9)
{
goto L5_continue;
}
t_9 = ((j == 0) != 0);
if (t_9)
{
obj = (double)element;
goto L9;
}
{
collen = nza + 1;
t_11 = collen;
*((int *) ( /* dim=0 */ (col_starts.data + t_11 * col_starts.strides[0]) )) = i + 1;
t_11 = collen;
*((int *) ( /* dim=0 */ (row_indices.data + t_11 * row_indices.strides[0]) )) = j;
t_11 = collen;
*((double *) ( /* dim=0 */ (x.data + t_11 * x.strides[0]) )) = ((double)element);
nza = nza + 1;
}
L9:;
L5_continue:;
}
t_11 = i;
ub = (*((float*) ( /* dim=0 */ (ub_values.data + t_11 * ub_values.strides[0]) )));
t_9 = (ub > FLT_MAX) != 0;
if (t_9)
{
too_large = 0.0;
goto L10;
}
t_9 = (ub > test_SMALL) != 0;
if (t_9)
{
ok = (double)ub;
}
L10:;
}
XDEC_MEMVIEW(&col_starts, 1);
XDEC_MEMVIEW(&row_indices, 1);
XDEC_MEMVIEW(&x, 1);
}
Here is the resulting main loop compiled in assembly using GCC 11.4 (with the compiler option -Ofast -funroll-loops -ftree-vectorize -march=skylake -fno-asynchronous-unwind-tables
):
.L4:
lea rdi, [r12-1]
xor eax, eax
and edi, 3
je .L5
vmovss xmm4, DWORD PTR [rcx]
mov eax, 1
vandps xmm5, xmm4, xmm1
vcomiss xmm9, xmm5
ja .L25
inc edx
movsx r14, edx
mov r15, r9
imul r15, r14
vcvtss2sd xmm6, xmm4, xmm4
mov DWORD PTR [r8+r15], esi
mov r15, r11
imul r15, r14
imul r14, rbp
mov DWORD PTR [r10+r15], 1
vmovsd QWORD PTR [rbx+r14], xmm6
.L25:
cmp rdi, 1
je .L5
cmp rdi, 2
je .L26
inc rax
vmovss xmm7, DWORD PTR [rcx-4+rax*4]
vandps xmm2, xmm7, xmm1
vcomiss xmm9, xmm2
ja .L26
inc edx
movsx rdi, edx
mov r14, r9
mov r15, r11
imul r14, rdi
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r8+r14], esi
vcvtss2sd xmm3, xmm7, xmm7
mov DWORD PTR [r10+r15], eax
vmovsd QWORD PTR [rbx+rdi], xmm3
.L26:
inc rax
vmovss xmm6, DWORD PTR [rcx-4+rax*4]
vandps xmm8, xmm6, xmm1
vcomiss xmm9, xmm8
jbe .L36
jmp .L5
.L7:
vmovss xmm10, DWORD PTR [rcx-4+rax*4]
vandps xmm11, xmm10, xmm1
vcomiss xmm9, xmm11
ja .L24
inc edx
movsx rdi, edx
mov r14, r9
mov r15, r11
imul r14, rdi
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r8+r14], esi
vcvtss2sd xmm12, xmm10, xmm10
mov DWORD PTR [r10+r15], eax
vmovsd QWORD PTR [rbx+rdi], xmm12
.L24:
lea r14, [rax+1]
vmovss xmm13, DWORD PTR [rcx-4+r14*4]
vandps xmm14, xmm13, xmm1
vcomiss xmm9, xmm14
ja .L27
inc edx
movsx rdi, edx
mov r15, r9
imul r15, rdi
vcvtss2sd xmm15, xmm13, xmm13
mov DWORD PTR [r8+r15], esi
mov r15, r11
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r10+r15], r14d
vmovsd QWORD PTR [rbx+rdi], xmm15
.L27:
lea r14, [rax+2]
vmovss xmm0, DWORD PTR [rcx-4+r14*4]
vandps xmm4, xmm0, xmm1
vcomiss xmm9, xmm4
ja .L28
inc edx
movsx rdi, edx
mov r15, r9
imul r15, rdi
vcvtss2sd xmm5, xmm0, xmm0
mov DWORD PTR [r8+r15], esi
mov r15, r11
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r10+r15], r14d
vmovsd QWORD PTR [rbx+rdi], xmm5
.L28:
add rax, 3
vmovss xmm6, DWORD PTR [rcx-4+rax*4]
vandps xmm7, xmm6, xmm1
vcomiss xmm9, xmm7
ja .L5
.L36:
inc edx
movsx rdi, edx
mov r14, r9
mov r15, r11
imul r14, rdi
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r8+r14], esi
vcvtss2sd xmm2, xmm6, xmm6
mov DWORD PTR [r10+r15], eax
vmovsd QWORD PTR [rbx+rdi], xmm2
.L5:
inc rax
cmp r12, rax
jne .L7
lea rax, [rsi+1]
add rcx, QWORD PTR [rsp+8]
cmp r13, rsi
je .L2
mov rsi, rax
jmp .L4
.L2:
Same with GCC 12.2:
.L4:
lea rdi, [r12-1]
xor eax, eax
and edi, 3
je .L5
vmovss xmm15, DWORD PTR [rcx]
mov eax, 1
vandps xmm4, xmm15, xmm1
vcomiss xmm0, xmm4
ja .L27
inc edx
movsx r14, edx
mov r15, r9
imul r15, r14
vcvtss2sd xmm5, xmm15, xmm15
mov DWORD PTR [r8+r15], esi
mov r15, r11
imul r15, r14
imul r14, rbp
mov DWORD PTR [r10+r15], 1
vmovsd QWORD PTR [rbx+r14], xmm5
.L27:
cmp rdi, 1
je .L5
cmp rdi, 2
je .L28
inc rax
vmovss xmm6, DWORD PTR [rcx-4+rax*4]
vandps xmm7, xmm6, xmm1
vcomiss xmm0, xmm7
ja .L28
inc edx
movsx rdi, edx
mov r14, r9
mov r15, r11
imul r14, rdi
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r8+r14], esi
vcvtss2sd xmm2, xmm6, xmm6
mov DWORD PTR [r10+r15], eax
vmovsd QWORD PTR [rbx+rdi], xmm2
.L28:
inc rax
vmovss xmm5, DWORD PTR [rcx-4+rax*4]
vandps xmm3, xmm5, xmm1
vcomiss xmm0, xmm3
jbe .L39
jmp .L5
.L41:
vmovss xmm8, DWORD PTR [rcx-4+rax*4]
vandps xmm9, xmm8, xmm1
vcomiss xmm0, xmm9
ja .L26
inc edx
movsx r15, edx
mov r14, r9
mov rdi, r11
imul r14, r15
imul rdi, r15
imul r15, rbp
mov DWORD PTR [r8+r14], esi
vcvtss2sd xmm10, xmm8, xmm8
mov DWORD PTR [r10+rdi], eax
vmovsd QWORD PTR [rbx+r15], xmm10
.L26:
lea r14, [rax+1]
vmovss xmm11, DWORD PTR [rcx-4+r14*4]
vandps xmm12, xmm11, xmm1
vcomiss xmm0, xmm12
ja .L29
inc edx
movsx rdi, edx
mov r15, r9
imul r15, rdi
vcvtss2sd xmm13, xmm11, xmm11
mov DWORD PTR [r8+r15], esi
mov r15, r11
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r10+r15], r14d
vmovsd QWORD PTR [rbx+rdi], xmm13
.L29:
lea r14, [rax+2]
vmovss xmm14, DWORD PTR [rcx-4+r14*4]
vandps xmm15, xmm14, xmm1
vcomiss xmm0, xmm15
ja .L30
inc edx
movsx rdi, edx
mov r15, r9
imul r15, rdi
vcvtss2sd xmm4, xmm14, xmm14
mov DWORD PTR [r8+r15], esi
mov r15, r11
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r10+r15], r14d
vmovsd QWORD PTR [rbx+rdi], xmm4
.L30:
add rax, 3
vmovss xmm5, DWORD PTR [rcx-4+rax*4]
vandps xmm6, xmm5, xmm1
vcomiss xmm0, xmm6
ja .L5
.L39:
inc edx
movsx rdi, edx
mov r14, r9
mov r15, r11
imul r14, rdi
imul r15, rdi
imul rdi, rbp
mov DWORD PTR [r8+r14], esi
vcvtss2sd xmm7, xmm5, xmm5
mov DWORD PTR [r10+r15], eax
vmovsd QWORD PTR [rbx+rdi], xmm7
.L5:
inc rax
cmp r12, rax
jne .L41
mov rdi, QWORD PTR [rsp+8]
lea rax, [rsi+1]
add rcx, rdi
cmp rsi, r13
je .L2
mov rsi, rax
jmp .L4
.L2:
Here is the same with GCC 13.1 (GCC 13.2 is not available on Godbolt):
.L4:
mov QWORD PTR [rsp+16], r14
xor eax, eax
.L20:
mov r8, rax
sub r8, r9
not r8
and r8d, 3
je .L6
inc rax
vmovss xmm15, DWORD PTR [rcx-4+rax*4]
vandps xmm0, xmm15, xmm2
vcomiss xmm1, xmm0
ja .L20
inc edx
mov r15, r10
mov rdi, QWORD PTR [rsp+24]
vcvtss2sd xmm4, xmm15, xmm15
movsx r14, edx
imul r15, r14
mov DWORD PTR [rdi+r15], esi
mov r15, rbx
imul r15, r14
imul r14, r13
mov DWORD PTR [r11+r15], eax
vmovsd QWORD PTR [r12+r14], xmm4
cmp r8, 1
je .L6
cmp r8, 2
je .L21
inc rax
vmovss xmm5, DWORD PTR [rcx-4+rax*4]
vandps xmm6, xmm5, xmm2
vcomiss xmm1, xmm6
ja .L20
inc edx
mov r14, r10
vcvtss2sd xmm7, xmm5, xmm5
movsx r8, edx
imul r14, r8
mov DWORD PTR [rdi+r14], esi
mov rdi, rbx
imul rdi, r8
imul r8, r13
mov DWORD PTR [r11+rdi], eax
vmovsd QWORD PTR [r12+r8], xmm7
.L21:
inc rax
vmovss xmm5, DWORD PTR [rcx-4+rax*4]
vandps xmm3, xmm5, xmm2
vcomiss xmm1, xmm3
ja .L20
inc edx
mov r14, r10
mov rdi, QWORD PTR [rsp+24]
movsx r8, edx
imul r14, r8
.L32:
mov r15, rbx
mov DWORD PTR [rdi+r14], esi
vcvtss2sd xmm8, xmm5, xmm5
imul r15, r8
imul r8, r13
mov DWORD PTR [r11+r15], eax
vmovsd QWORD PTR [r12+r8], xmm8
.L6:
lea r8, [rax+1]
mov rax, r8
cmp r9, r8
je .L7
vmovss xmm9, DWORD PTR [rcx-4+r8*4]
vandps xmm10, xmm9, xmm2
vcomiss xmm1, xmm10
ja .L20
inc edx
mov r15, r10
mov rdi, QWORD PTR [rsp+24]
vcvtss2sd xmm11, xmm9, xmm9
movsx r14, edx
mov DWORD PTR [rsp+4], edx
imul r15, r14
mov DWORD PTR [rdi+r15], esi
mov r15, rbx
imul r15, r14
imul r14, r13
mov DWORD PTR [r11+r15], eax
inc rax
vmovss xmm12, DWORD PTR [rcx-4+rax*4]
vmovsd QWORD PTR [r12+r14], xmm11
vandps xmm13, xmm12, xmm2
vcomiss xmm1, xmm13
ja .L20
inc edx
mov r15, r10
vcvtss2sd xmm14, xmm12, xmm12
movsx r14, edx
imul r15, r14
mov DWORD PTR [rdi+r15], esi
mov r15, rbx
imul r15, r14
imul r14, r13
mov DWORD PTR [r11+r15], eax
lea rax, [r8+2]
vmovss xmm15, DWORD PTR [rcx-4+rax*4]
vmovsd QWORD PTR [r12+r14], xmm14
vandps xmm0, xmm15, xmm2
vcomiss xmm1, xmm0
ja .L20
mov edx, DWORD PTR [rsp+4]
mov r15, r10
vcvtss2sd xmm4, xmm15, xmm15
add edx, 2
movsx r14, edx
imul r15, r14
mov DWORD PTR [rdi+r15], esi
mov r15, rbx
imul r15, r14
imul r14, r13
mov DWORD PTR [r11+r15], eax
lea rax, [r8+3]
vmovss xmm5, DWORD PTR [rcx-4+rax*4]
vmovsd QWORD PTR [r12+r14], xmm4
vandps xmm6, xmm5, xmm2
vcomiss xmm1, xmm6
ja .L20
mov edx, DWORD PTR [rsp+4]
mov r14, r10
add edx, 3
movsx r8, edx
imul r14, r8
jmp .L32
.L7:
mov rdi, QWORD PTR [rsp+8]
mov r14, QWORD PTR [rsp+16]
lea rax, [rsi+1]
add rcx, rdi
cmp r14, rsi
je .L2
mov rsi, rax
jmp .L4
.L2:
First of all, we can see that the main loop is full of conditional branches mixed with goto
. Such kind of code are generally not written by humans. Compiler can expect them to be rare and so optimize them less efficiently than ones without goto
(since assertions are easier to compute). The thing is the j == 0
condition is not really useful here: it can be done only once before the j
-based loop. GCC does not seem to see that (especially due to the complex control-flow, amplified by Cython goto
instructions). On to of that, the compiler do not know if fabsf(element) < SMALL
is often true or often false, or even just hard to predict at runtime. If the loop is easy to predict (the first two cases), then a good compiler can write a more efficient code than the one produced by GCC (possibly using SIMD instructions).
Actually, I am not even sure -funroll-loops
is useful here because of the complex control-flow. It makes the hot-loop significantly bigger (i.e., more space in caches) and the processor migh take more time to predict the probability of the taken/not-taken branches. This is dependent of the target processor. Having smaller code is also good for profiling and optimizing your code: simpler code are easier to optimize.
To optimize the code, the compiler make assumptions on the cost of the loops and the probability of the branches to be taken. If can assume for example that the inner loop if far more expensive than the outer loop. Heuristics are used to optimize the code and their parameters are carefully tuned so to maximize the performance of benchmark suites (and avoid regressions in code like the Linux kernel for GCC). Thus, a side effect is that your code can be slower while most code can be faster using a newer version of GCC, or most compilers. This is also one reason the performance of a given compiler is not very stable from one version to another. For example, if a matrix has a lot of zeros, then fabsf(element) < SMALL
will be often true, causing the outer-loop to be more expensive than expected. The thing is the outer loop might less well optimized in GCC 13.2 than GCC 12.2, resulting in a slower execution time with GCC 13.2. For example, GCC 13.2 starts the outer loop with:
mov r8, rax
sub r8, r9
not r8
and r8d, 3
je .L6
Note that modern processors, like Skylake, can execute multiple instructions in parallel per cycle. This sequence of instructions can only be executed sequentially because of the dependency chain on the r8
register. Meanwhile, GCC 12.2 generate this in the beginning of the outer loop:
lea rdi, [r12-1]
xor eax, eax
and edi, 3
je .L5
The dependency chain (on rdi
/edi
) is apparently significantly smaller in this case. In practice, the assembly code is pretty big, so I do not expect this to be the only issue. One need to track the actual path taken for the input matrices in all versions of GCC (very tedious), or just run the code and profile it with tools like VTune (the recommended solution in this case).
We can see that there is a lot of imul
instructions. Such instructions are typically due to the strides of the array that are not known at compile-time. GCC does not generate code for the strides that are 1. You should really mention this information in the memory views using ::1
if possible (as mentioned by ead in the comments) since imul
instructions tends to be a bit expensive in hot loops like this (not to mention they can also increase the size of the code which is already bigger due to the partial unrolling).