Search code examples
pythongccvisual-c++cythoncompiler-optimization

Cython execution speed vs. MSVC and GCC versions


Intro

I have a fairly simple Cython module - which I simplified even further for the specific tests I have carried on.

This module has only one class, which has only one interesting method (named run): this method accepts as an input a Fortran-ordered 2D NumPy array and two 1D NumPy arrays, and does some very, very simple things on those (see below for code).

For the sake of benchmarking, I have compiled the exact same module with MSVC, GCC 8, GCC 11, GCC 12 and GCC 13 on Windows 10 64 bit, Python 3.9.10 64 bit, Cython 0.29.32. All the GCC compilers I have obtained from the excellent Brecht Sanders GitHub page (https://github.com/brechtsanders).

Main Question

The overarching question of this very long post is: I am just curious to know if anyone has any explanation regarding why GCC12 and GCC13 are so much slower than GCC11 (which is the fastest of all). Looks like performances are going down at each release of GCC, rather than getting better...

Benchmarks

In the benchmarking, I simply vary the array dimensions of the 2D and 1D arrays (m and n) and the number on nonzero entries in the 2D and 1D arrays. I repeat the run method 20 times per compiler version, per set of m and n and nonzero entries.

Optimization settings I am using:

MVSC

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']

GCC

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

What I am observing is the following:

  • MSVC is by far the slowest at executing the benchmark (why would that be on Windows?)
  • The progression GCC8 -> GCC11 is promising, as GCC11 is faster than GCC8
  • GCC12 and GCC13 are both significantly slower than GCC11, with GCC13 being the worst (twice as slow as GCC11 and much worse than GCC12)

Table of Results:

Runtimes are in milliseconds (ms)

enter image description here

Graph (NOTE: Logarithmic Y axis!!):

Runtimes are in milliseconds (ms)

enter image description here

Code

Cython file:

###############################################################################

import numpy as np
cimport numpy as np
import cython

from cython.view cimport array as cvarray

from libc.float cimport FLT_MAX

DTYPE_float = np.float32
ctypedef np.float32_t DTYPE_float_t

cdef float SMALL = 1e-10
cdef int MAXSIZE = 1000000

cdef extern from "math.h" nogil:
    cdef float fabsf(float x)

###############################################################################

@cython.final
cdef class CythonLoops:

    cdef int m, n
    cdef int [:] col_starts
    cdef int [:] row_indices
    cdef double [:] x

    def __cinit__(self):

        self.m = 0
        self.n = 0

        self.col_starts  = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.row_indices = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.x = cvarray(shape=(MAXSIZE,), itemsize=sizeof(double), format='d')

    @cython.boundscheck(False) # turn off bounds-checking for entire function
    @cython.wraparound(False)  # turn off negative index wrapping for entire function
    @cython.nonecheck(False)
    @cython.initializedcheck(False)
    @cython.cdivision(True)
    cpdef run(self, DTYPE_float_t[::1, :] matrix,
                    DTYPE_float_t[:]      ub_values,
                    DTYPE_float_t[:]      priority):

        cdef Py_ssize_t i, j, m, n
        cdef int nza, collen
        cdef double too_large, ok, obj
        cdef float ub, element

        cdef int [:] col_starts = self.col_starts
        cdef int [:] row_indices = self.row_indices
        cdef double [:] x = self.x

        m = matrix.shape[0]
        n = matrix.shape[1]

        self.m = m
        self.n = n

        nza = 0
        collen = 0
        for i in range(n):
            for j in range(m+1):
                if j == 0:
                    element = priority[i]
                else:
                    element = matrix[j-1, i]

                if fabsf(element) < SMALL:
                    continue

                if j == 0:
                    obj = <double>element
                    # Do action 1 with external library
                else:
                    collen = nza + 1
                    col_starts[collen] = i+1
                    row_indices[collen] = j
                    x[collen] = <double>element
                    nza += 1

            ub = ub_values[i]

            if ub > FLT_MAX:
                too_large = 0.0
                # Do action 2 with external library
            elif ub > SMALL:
                ok = <double>ub
                # Do action 3 with external library

        # Use x, row_indices and col_starts in the external library

Setup file:

I use the following to compile it:

python setup.py build_ext --inplace --compiler=mingw32 gcc13

Where the last argument is the compiler I want to test

#!/usr/bin/env python

from setuptools import setup
from setuptools import Extension

from Cython.Build import cythonize
from Cython.Distutils import build_ext

import numpy as np
import os
import shutil
import sys
import getpass


MODULE = 'loop_cython_%s'

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']
MSVC_EXTRA_LINK_ARGS = MSVC_EXTRA_COMPILE_ARGS


def remove_builds(kind):

    for folder in ['build', 'bin']:
        if os.path.isdir(folder):
            if folder == 'bin':
                continue
            shutil.rmtree(folder, ignore_errors=True)

    if os.path.isfile(MODULE + '_%s.c'%kind):
        os.remove(MODULE + '_%s.c'%kind)


def setenv(extra_args, doset=True, path=None, kind='gcc8'):

    flags = ''
    if doset:
        flags = '  '.join(extra_args)

    for key in ['CFLAGS', 'FFLAGS', 'CPPFLAGS']:
        os.environ[key] = flags

    user = getpass.getuser()

    if doset:
        path = os.environ['PATH']
        if kind == 'gcc8':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_8.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc11':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc12':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_12.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc13':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_13.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'msvc':
            os.environ['PATH'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\Hostx64\x64;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\J0514162\WinPython39\WPy64-39100\python-3.9.10.amd64;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22000.0\x64'
            os.environ['LIB'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\lib\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\um\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\ucrt\x64'
            os.environ["DISTUTILS_USE_SDK"] = '1'
            os.environ["MSSdk"] = '1'

    else:
        os.environ['PATH'] = path

    return path

class CustomBuildExt(build_ext):
    def build_extensions(self):
        # Override the compiler executables. Importantly, this
        # removes the "default" compiler flags that would
        # otherwise get passed on to to the compiler, i.e.,
        # distutils.sysconfig.get_var("CFLAGS").
        self.compiler.set_executable("compiler_so", "gcc -mdll -O -Wall -DMS_WIN64")
        self.compiler.set_executable("compiler_cxx", "g++ -O -Wall -DMS_WIN64")
        self.compiler.set_executable("linker_so", "gcc -shared -static")
        self.compiler.dll_libraries = []
        build_ext.build_extensions(self)

if __name__ == '__main__':

    os.system('cls')

    kind = None
    for arg in sys.argv:
        if arg.strip() in ['gcc8', 'gcc11', 'gcc12', 'gcc13', 'msvc']:
            kind = arg
            sys.argv.remove(arg)
            break

    base_file = os.path.join(os.getcwd(), MODULE[0:-3])


    source = base_file + '.pyx'
    target = base_file + '_%s.pyx'%kind
    shutil.copyfile(source, target)

    if kind == 'msvc':
        extra_compile_args = MSVC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = MSVC_EXTRA_LINK_ARGS[:] + ['/MANIFEST']
    else:
        extra_compile_args = GCC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = GCC_EXTRA_LINK_ARGS[:]

    path = setenv(extra_compile_args, kind=kind)
    remove_builds(kind)

    define_macros = [('WIN32', 1)]

    nname = MODULE%kind

    include_dirs = [np.get_include()]
    if kind == 'msvc':
        include_dirs += [r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\ucrt',
                         r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include',
                         r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\shared']

    extensions = [
        Extension(nname, [nname + '.pyx'],
                  extra_compile_args=extra_compile_args,
                  extra_link_args=extra_link_args,
                  include_dirs=include_dirs,
                  define_macros=define_macros)]

    # build the core extension(s)
    setup_kwargs = {'ext_modules': cythonize(extensions,
                                             compiler_directives={'embedsignature'  : False,
                                                                  'boundscheck'     : False,
                                                                  'wraparound'      : False,
                                                                  'initializedcheck': False,
                                                                  'cdivision'       : True,
                                                                  'language_level'  : '3str',
                                                                  'nonecheck'       : False},
                                             force=True,
                                             cache=False,
                                             quiet=False)}

    if kind != 'msvc':
        setup_kwargs['cmdclass'] = {'build_ext': CustomBuildExt}

    setup(**setup_kwargs)

    setenv([], False, path)
    remove_builds(kind)

Test code:

import os
import numpy
import time

import loop_cython_msvc as msvc
import loop_cython_gcc8 as gcc8
import loop_cython_gcc11 as gcc11
import loop_cython_gcc12 as gcc12
import loop_cython_gcc13 as gcc13

              # M     N     NNZ(matrix) NNZ(priority) NNZ(ub)
DIMENSIONS = [(1661 , 2608 , 3560      , 375         , 2488 ),
              (2828 , 3512 , 4333      , 413         , 2973 ),
              (780  , 985  , 646       , 23          , 984  ),
              (799  , 1558 , 1883      , 301         , 1116 ),
              (399  , 540  , 388       , 44          , 517  ),
              (10545, 10486, 14799     , 1053        , 10041),
              (3369 , 3684 , 3684      , 256         , 3242 ),
              (2052 , 5513 , 4772      , 1269        , 3319 ),
              (224  , 628  , 1345      , 396         , 594  ),
              (553  , 1475 , 1315      , 231         , 705  )]


def RunTest():

    print('M      N      NNZ     MSVC     GCC 8    GCC 11   GCC 12   GCC 13')

    for m, n, nnz_mat, nnz_priority, nnz_ub in DIMENSIONS:
        print('%-6d %-6d %-8d'%(m, n, nnz_mat), end='')

        for solver, label in zip([msvc, gcc8, gcc11, gcc12, gcc13], ['MSVC', 'GCC 8', 'GCC 11', 'GCC 12', 'GCC 13']):

            numpy.random.seed(123456)
            size = m*n
            idxes = numpy.arange(size)
            matrix = numpy.zeros((size, ), dtype=numpy.float32)
            idx_mat = numpy.random.choice(idxes, nnz_mat)
            matrix[idx_mat] = numpy.random.uniform(0, 1000, size=(nnz_mat, ))

            matrix = numpy.asfortranarray(matrix.reshape((m, n)))

            idxes = numpy.arange(m)
            priority = numpy.zeros((m, ), dtype=numpy.float32)
            idx_pri = numpy.random.choice(idxes, nnz_priority)
            priority[idx_pri] = numpy.random.uniform(0, 1000, size=(nnz_priority, ))

            idxes = numpy.arange(n)
            ub_values = numpy.inf*numpy.ones((n, ), dtype=numpy.float32)
            idx_ub = numpy.random.choice(idxes, nnz_ub)
            ub_values[idx_ub] = numpy.random.uniform(0, 1000, size=(nnz_ub, ))

            solver = solver.CythonLoops()
            time_tot = []

            for i in range(20):

                start = time.perf_counter()
                solver.run(matrix, ub_values, priority)
                elapsed = time.perf_counter() - start
                time_tot.append(elapsed*1e3)

            print('%-8.4g'%numpy.mean(time_tot), end=' ')

        print()

if __name__ == '__main__':

    os.system('cls')
    RunTest()

EDIT

After @PeterCordes comments, I have changed the optimization flags to this:

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL', '/QIntel-jcc-erratum']
GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables', '-Wa,-mbranches-within-32B-boundaries']

MSVC appears to be marginally faster than before (between 5% and 10%), but GCC12 and GCC13 are slower than before (between 3% and 20%). Below a graph with the results on the largest 2D matrix:

Note: "Current" means with the latest optimization flags suggested by @PeterCordes, "Previous" is the original set of flags.

enter image description here


Solution

  • This is a partial answer providing the generated C code produced by Cython once it has been simplified a bit to be shorter, more human-readable and easy to compile without any Cython, Python, or NumPy dependencies (the transformations are not expected to drastically impact the timings). It also show the generated assembly code and few comments.

    Here is the C code:

    #include <stdlib.h>
    #include <stdint.h>
    #include <math.h>
    #include <float.h>
    
    /* Remove external dependencies */
    typedef int64_t Py_ssize_t;
    typedef void memoryview_obj;
    typedef void* VTable;
    
    typedef struct {
      struct memoryview_obj *memview;
      char *data;
      Py_ssize_t shape[8];
      Py_ssize_t strides[8];
      Py_ssize_t suboffsets[8];
    } memviewslice;
    
    struct Self
    {
        /* PyObject_HEAD */
        struct VTable* tab;
        int m;
        int n;
        memviewslice col_starts;
        memviewslice row_indices;
        memviewslice x;
    };
    
    static float test_SMALL = 1e-10;
    
    extern void INC_MEMVIEW(memviewslice* slice, int count);
    extern void XDEC_MEMVIEW(memviewslice* slice, int count);
    
    void run(struct Self* self, memviewslice matrix, memviewslice ub_values, memviewslice priority)
    {
        Py_ssize_t i;
        Py_ssize_t j;
        Py_ssize_t m;
        Py_ssize_t n;
        int nza;
        int collen;
        double too_large;
        double ok;
        double obj;
        float ub;
        float element;
        memviewslice col_starts = { 0, 0, { 0 }, { 0 }, { 0 } };
        memviewslice row_indices = { 0, 0, { 0 }, { 0 }, { 0 } };
        memviewslice x = { 0, 0, { 0 }, { 0 }, { 0 } };
        memviewslice t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
        memviewslice t_2 = { 0, 0, { 0 }, { 0 }, { 0 } };
        Py_ssize_t t_3;
        Py_ssize_t t_4;
        Py_ssize_t t_6;
        Py_ssize_t t_7;
        int t_9;
        Py_ssize_t t_10;
        Py_ssize_t t_11;
    
        t_1 = self->col_starts;
        INC_MEMVIEW(&t_1, 1);
        col_starts = t_1;
        t_1.memview = NULL;
        t_1.data = NULL;
    
        t_1 = self->row_indices;
        INC_MEMVIEW(&t_1, 1);
        row_indices = t_1;
        t_1.memview = NULL;
        t_1.data = NULL;
    
        t_2 = self->x;
        INC_MEMVIEW(&t_2, 1);
        x = t_2;
        t_2.memview = NULL;
        t_2.data = NULL;
    
        m = matrix.shape[0];
        n = matrix.shape[1];
        self->m = m;
        self->n = n;
        nza = 0;
        collen = 0;
    
        t_3 = n;
        t_4 = t_3;
        for (i = 0; i < t_4; i++)
        {
            t_6 = m + 1;
            t_7 = t_6;
    
            for (j = 0; j < t_7; j++)
            {
                t_9 = (j == 0) != 0;
    
                if (t_9)
                {
                    t_10 = i;
                    element = (*((float*) ( /* dim=0 */ (priority.data + t_10 * priority.strides[0]) )));
    
                    goto L7;
                }
    
                {
                    t_10 = j - 1;
                    t_11 = i;
                    element = (*((float*) ( /* dim=1 */ (( /* dim=0 */ ((char *) (((float*) matrix.data) + t_10)) ) + t_11 * matrix.strides[1]) )));
                }
    
                L7:;
    
                t_9 = (fabsf(element) < test_SMALL) != 0;
                if (t_9)
                {
                    goto L5_continue;
                }
    
                t_9 = ((j == 0) != 0);
                if (t_9)
                {
    
                    obj = (double)element;
    
                    goto L9;
                }
    
                {
                    collen = nza + 1;
    
                    t_11 = collen;
                    *((int *) ( /* dim=0 */ (col_starts.data + t_11 * col_starts.strides[0]) )) = i + 1;
    
                    t_11 = collen;
                    *((int *) ( /* dim=0 */ (row_indices.data + t_11 * row_indices.strides[0]) )) = j;
    
                    t_11 = collen;
                    *((double *) ( /* dim=0 */ (x.data + t_11 * x.strides[0]) )) = ((double)element);
    
                    nza = nza + 1;
                }
    
                L9:;
                L5_continue:;
            }
    
            t_11 = i;
            ub = (*((float*) ( /* dim=0 */ (ub_values.data + t_11 * ub_values.strides[0]) )));
    
            t_9 = (ub > FLT_MAX) != 0;
            if (t_9)
            {
                too_large = 0.0;
                goto L10;
            }
    
            t_9 = (ub > test_SMALL) != 0;
            if (t_9)
            {
                ok = (double)ub;
            }
    
            L10:;
        }
    
        XDEC_MEMVIEW(&col_starts, 1);
        XDEC_MEMVIEW(&row_indices, 1);
        XDEC_MEMVIEW(&x, 1);
    }
    

    Here is the resulting main loop compiled in assembly using GCC 11.4 (with the compiler option -Ofast -funroll-loops -ftree-vectorize -march=skylake -fno-asynchronous-unwind-tables):

    .L4:
            lea     rdi, [r12-1]
            xor     eax, eax
            and     edi, 3
            je      .L5
            vmovss  xmm4, DWORD PTR [rcx]
            mov     eax, 1
            vandps  xmm5, xmm4, xmm1
            vcomiss xmm9, xmm5
            ja      .L25
            inc     edx
            movsx   r14, edx
            mov     r15, r9
            imul    r15, r14
            vcvtss2sd       xmm6, xmm4, xmm4
            mov     DWORD PTR [r8+r15], esi
            mov     r15, r11
            imul    r15, r14
            imul    r14, rbp
            mov     DWORD PTR [r10+r15], 1
            vmovsd  QWORD PTR [rbx+r14], xmm6
    .L25:
            cmp     rdi, 1
            je      .L5
            cmp     rdi, 2
            je      .L26
            inc     rax
            vmovss  xmm7, DWORD PTR [rcx-4+rax*4]
            vandps  xmm2, xmm7, xmm1
            vcomiss xmm9, xmm2
            ja      .L26
            inc     edx
            movsx   rdi, edx
            mov     r14, r9
            mov     r15, r11
            imul    r14, rdi
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r8+r14], esi
            vcvtss2sd       xmm3, xmm7, xmm7
            mov     DWORD PTR [r10+r15], eax
            vmovsd  QWORD PTR [rbx+rdi], xmm3
    .L26:
            inc     rax
            vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
            vandps  xmm8, xmm6, xmm1
            vcomiss xmm9, xmm8
            jbe     .L36
            jmp     .L5
    .L7:
            vmovss  xmm10, DWORD PTR [rcx-4+rax*4]
            vandps  xmm11, xmm10, xmm1
            vcomiss xmm9, xmm11
            ja      .L24
            inc     edx
            movsx   rdi, edx
            mov     r14, r9
            mov     r15, r11
            imul    r14, rdi
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r8+r14], esi
            vcvtss2sd       xmm12, xmm10, xmm10
            mov     DWORD PTR [r10+r15], eax
            vmovsd  QWORD PTR [rbx+rdi], xmm12
    .L24:
            lea     r14, [rax+1]
            vmovss  xmm13, DWORD PTR [rcx-4+r14*4]
            vandps  xmm14, xmm13, xmm1
            vcomiss xmm9, xmm14
            ja      .L27
            inc     edx
            movsx   rdi, edx
            mov     r15, r9
            imul    r15, rdi
            vcvtss2sd       xmm15, xmm13, xmm13
            mov     DWORD PTR [r8+r15], esi
            mov     r15, r11
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r10+r15], r14d
            vmovsd  QWORD PTR [rbx+rdi], xmm15
    .L27:
            lea     r14, [rax+2]
            vmovss  xmm0, DWORD PTR [rcx-4+r14*4]
            vandps  xmm4, xmm0, xmm1
            vcomiss xmm9, xmm4
            ja      .L28
            inc     edx
            movsx   rdi, edx
            mov     r15, r9
            imul    r15, rdi
            vcvtss2sd       xmm5, xmm0, xmm0
            mov     DWORD PTR [r8+r15], esi
            mov     r15, r11
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r10+r15], r14d
            vmovsd  QWORD PTR [rbx+rdi], xmm5
    .L28:
            add     rax, 3
            vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
            vandps  xmm7, xmm6, xmm1
            vcomiss xmm9, xmm7
            ja      .L5
    .L36:
            inc     edx
            movsx   rdi, edx
            mov     r14, r9
            mov     r15, r11
            imul    r14, rdi
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r8+r14], esi
            vcvtss2sd       xmm2, xmm6, xmm6
            mov     DWORD PTR [r10+r15], eax
            vmovsd  QWORD PTR [rbx+rdi], xmm2
    .L5:
            inc     rax
            cmp     r12, rax
            jne     .L7
            lea     rax, [rsi+1]
            add     rcx, QWORD PTR [rsp+8]
            cmp     r13, rsi
            je      .L2
            mov     rsi, rax
            jmp     .L4
    .L2:
    

    Same with GCC 12.2:

    .L4:
            lea     rdi, [r12-1]
            xor     eax, eax
            and     edi, 3
            je      .L5
            vmovss  xmm15, DWORD PTR [rcx]
            mov     eax, 1
            vandps  xmm4, xmm15, xmm1
            vcomiss xmm0, xmm4
            ja      .L27
            inc     edx
            movsx   r14, edx
            mov     r15, r9
            imul    r15, r14
            vcvtss2sd       xmm5, xmm15, xmm15
            mov     DWORD PTR [r8+r15], esi
            mov     r15, r11
            imul    r15, r14
            imul    r14, rbp
            mov     DWORD PTR [r10+r15], 1
            vmovsd  QWORD PTR [rbx+r14], xmm5
    .L27:
            cmp     rdi, 1
            je      .L5
            cmp     rdi, 2
            je      .L28
            inc     rax
            vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
            vandps  xmm7, xmm6, xmm1
            vcomiss xmm0, xmm7
            ja      .L28
            inc     edx
            movsx   rdi, edx
            mov     r14, r9
            mov     r15, r11
            imul    r14, rdi
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r8+r14], esi
            vcvtss2sd       xmm2, xmm6, xmm6
            mov     DWORD PTR [r10+r15], eax
            vmovsd  QWORD PTR [rbx+rdi], xmm2
    .L28:
            inc     rax
            vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
            vandps  xmm3, xmm5, xmm1
            vcomiss xmm0, xmm3
            jbe     .L39
            jmp     .L5
    .L41:
            vmovss  xmm8, DWORD PTR [rcx-4+rax*4]
            vandps  xmm9, xmm8, xmm1
            vcomiss xmm0, xmm9
            ja      .L26
            inc     edx
            movsx   r15, edx
            mov     r14, r9
            mov     rdi, r11
            imul    r14, r15
            imul    rdi, r15
            imul    r15, rbp
            mov     DWORD PTR [r8+r14], esi
            vcvtss2sd       xmm10, xmm8, xmm8
            mov     DWORD PTR [r10+rdi], eax
            vmovsd  QWORD PTR [rbx+r15], xmm10
    .L26:
            lea     r14, [rax+1]
            vmovss  xmm11, DWORD PTR [rcx-4+r14*4]
            vandps  xmm12, xmm11, xmm1
            vcomiss xmm0, xmm12
            ja      .L29
            inc     edx
            movsx   rdi, edx
            mov     r15, r9
            imul    r15, rdi
            vcvtss2sd       xmm13, xmm11, xmm11
            mov     DWORD PTR [r8+r15], esi
            mov     r15, r11
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r10+r15], r14d
            vmovsd  QWORD PTR [rbx+rdi], xmm13
    .L29:
            lea     r14, [rax+2]
            vmovss  xmm14, DWORD PTR [rcx-4+r14*4]
            vandps  xmm15, xmm14, xmm1
            vcomiss xmm0, xmm15
            ja      .L30
            inc     edx
            movsx   rdi, edx
            mov     r15, r9
            imul    r15, rdi
            vcvtss2sd       xmm4, xmm14, xmm14
            mov     DWORD PTR [r8+r15], esi
            mov     r15, r11
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r10+r15], r14d
            vmovsd  QWORD PTR [rbx+rdi], xmm4
    .L30:
            add     rax, 3
            vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
            vandps  xmm6, xmm5, xmm1
            vcomiss xmm0, xmm6
            ja      .L5
    .L39:
            inc     edx
            movsx   rdi, edx
            mov     r14, r9
            mov     r15, r11
            imul    r14, rdi
            imul    r15, rdi
            imul    rdi, rbp
            mov     DWORD PTR [r8+r14], esi
            vcvtss2sd       xmm7, xmm5, xmm5
            mov     DWORD PTR [r10+r15], eax
            vmovsd  QWORD PTR [rbx+rdi], xmm7
    .L5:
            inc     rax
            cmp     r12, rax
            jne     .L41
            mov     rdi, QWORD PTR [rsp+8]
            lea     rax, [rsi+1]
            add     rcx, rdi
            cmp     rsi, r13
            je      .L2
            mov     rsi, rax
            jmp     .L4
    .L2:
    

    Here is the same with GCC 13.1 (GCC 13.2 is not available on Godbolt):

    .L4:
            mov     QWORD PTR [rsp+16], r14
            xor     eax, eax
    .L20:
            mov     r8, rax
            sub     r8, r9
            not     r8
            and     r8d, 3
            je      .L6
            inc     rax
            vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
            vandps  xmm0, xmm15, xmm2
            vcomiss xmm1, xmm0
            ja      .L20
            inc     edx
            mov     r15, r10
            mov     rdi, QWORD PTR [rsp+24]
            vcvtss2sd       xmm4, xmm15, xmm15
            movsx   r14, edx
            imul    r15, r14
            mov     DWORD PTR [rdi+r15], esi
            mov     r15, rbx
            imul    r15, r14
            imul    r14, r13
            mov     DWORD PTR [r11+r15], eax
            vmovsd  QWORD PTR [r12+r14], xmm4
            cmp     r8, 1
            je      .L6
            cmp     r8, 2
            je      .L21
            inc     rax
            vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
            vandps  xmm6, xmm5, xmm2
            vcomiss xmm1, xmm6
            ja      .L20
            inc     edx
            mov     r14, r10
            vcvtss2sd       xmm7, xmm5, xmm5
            movsx   r8, edx
            imul    r14, r8
            mov     DWORD PTR [rdi+r14], esi
            mov     rdi, rbx
            imul    rdi, r8
            imul    r8, r13
            mov     DWORD PTR [r11+rdi], eax
            vmovsd  QWORD PTR [r12+r8], xmm7
    .L21:
            inc     rax
            vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
            vandps  xmm3, xmm5, xmm2
            vcomiss xmm1, xmm3
            ja      .L20
            inc     edx
            mov     r14, r10
            mov     rdi, QWORD PTR [rsp+24]
            movsx   r8, edx
            imul    r14, r8
    .L32:
            mov     r15, rbx
            mov     DWORD PTR [rdi+r14], esi
            vcvtss2sd       xmm8, xmm5, xmm5
            imul    r15, r8
            imul    r8, r13
            mov     DWORD PTR [r11+r15], eax
            vmovsd  QWORD PTR [r12+r8], xmm8
    .L6:
            lea     r8, [rax+1]
            mov     rax, r8
            cmp     r9, r8
            je      .L7
            vmovss  xmm9, DWORD PTR [rcx-4+r8*4]
            vandps  xmm10, xmm9, xmm2
            vcomiss xmm1, xmm10
            ja      .L20
            inc     edx
            mov     r15, r10
            mov     rdi, QWORD PTR [rsp+24]
            vcvtss2sd       xmm11, xmm9, xmm9
            movsx   r14, edx
            mov     DWORD PTR [rsp+4], edx
            imul    r15, r14
            mov     DWORD PTR [rdi+r15], esi
            mov     r15, rbx
            imul    r15, r14
            imul    r14, r13
            mov     DWORD PTR [r11+r15], eax
            inc     rax
            vmovss  xmm12, DWORD PTR [rcx-4+rax*4]
            vmovsd  QWORD PTR [r12+r14], xmm11
            vandps  xmm13, xmm12, xmm2
            vcomiss xmm1, xmm13
            ja      .L20
            inc     edx
            mov     r15, r10
            vcvtss2sd       xmm14, xmm12, xmm12
            movsx   r14, edx
            imul    r15, r14
            mov     DWORD PTR [rdi+r15], esi
            mov     r15, rbx
            imul    r15, r14
            imul    r14, r13
            mov     DWORD PTR [r11+r15], eax
            lea     rax, [r8+2]
            vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
            vmovsd  QWORD PTR [r12+r14], xmm14
            vandps  xmm0, xmm15, xmm2
            vcomiss xmm1, xmm0
            ja      .L20
            mov     edx, DWORD PTR [rsp+4]
            mov     r15, r10
            vcvtss2sd       xmm4, xmm15, xmm15
            add     edx, 2
            movsx   r14, edx
            imul    r15, r14
            mov     DWORD PTR [rdi+r15], esi
            mov     r15, rbx
            imul    r15, r14
            imul    r14, r13
            mov     DWORD PTR [r11+r15], eax
            lea     rax, [r8+3]
            vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
            vmovsd  QWORD PTR [r12+r14], xmm4
            vandps  xmm6, xmm5, xmm2
            vcomiss xmm1, xmm6
            ja      .L20
            mov     edx, DWORD PTR [rsp+4]
            mov     r14, r10
            add     edx, 3
            movsx   r8, edx
            imul    r14, r8
            jmp     .L32
    .L7:
            mov     rdi, QWORD PTR [rsp+8]
            mov     r14, QWORD PTR [rsp+16]
            lea     rax, [rsi+1]
            add     rcx, rdi
            cmp     r14, rsi
            je      .L2
            mov     rsi, rax
            jmp     .L4
    .L2:
    

    Discussion

    First of all, we can see that the main loop is full of conditional branches mixed with goto. Such kind of code are generally not written by humans. Compiler can expect them to be rare and so optimize them less efficiently than ones without goto (since assertions are easier to compute). The thing is the j == 0 condition is not really useful here: it can be done only once before the j-based loop. GCC does not seem to see that (especially due to the complex control-flow, amplified by Cython goto instructions). On to of that, the compiler do not know if fabsf(element) < SMALL is often true or often false, or even just hard to predict at runtime. If the loop is easy to predict (the first two cases), then a good compiler can write a more efficient code than the one produced by GCC (possibly using SIMD instructions).

    Actually, I am not even sure -funroll-loops is useful here because of the complex control-flow. It makes the hot-loop significantly bigger (i.e., more space in caches) and the processor migh take more time to predict the probability of the taken/not-taken branches. This is dependent of the target processor. Having smaller code is also good for profiling and optimizing your code: simpler code are easier to optimize.

    To optimize the code, the compiler make assumptions on the cost of the loops and the probability of the branches to be taken. If can assume for example that the inner loop if far more expensive than the outer loop. Heuristics are used to optimize the code and their parameters are carefully tuned so to maximize the performance of benchmark suites (and avoid regressions in code like the Linux kernel for GCC). Thus, a side effect is that your code can be slower while most code can be faster using a newer version of GCC, or most compilers. This is also one reason the performance of a given compiler is not very stable from one version to another. For example, if a matrix has a lot of zeros, then fabsf(element) < SMALL will be often true, causing the outer-loop to be more expensive than expected. The thing is the outer loop might less well optimized in GCC 13.2 than GCC 12.2, resulting in a slower execution time with GCC 13.2. For example, GCC 13.2 starts the outer loop with:

            mov     r8, rax
            sub     r8, r9
            not     r8
            and     r8d, 3
            je      .L6
    

    Note that modern processors, like Skylake, can execute multiple instructions in parallel per cycle. This sequence of instructions can only be executed sequentially because of the dependency chain on the r8 register. Meanwhile, GCC 12.2 generate this in the beginning of the outer loop:

            lea     rdi, [r12-1]
            xor     eax, eax
            and     edi, 3
            je      .L5
    

    The dependency chain (on rdi/edi) is apparently significantly smaller in this case. In practice, the assembly code is pretty big, so I do not expect this to be the only issue. One need to track the actual path taken for the input matrices in all versions of GCC (very tedious), or just run the code and profile it with tools like VTune (the recommended solution in this case).

    We can see that there is a lot of imul instructions. Such instructions are typically due to the strides of the array that are not known at compile-time. GCC does not generate code for the strides that are 1. You should really mention this information in the memory views using ::1 if possible (as mentioned by ead in the comments) since imul instructions tends to be a bit expensive in hot loops like this (not to mention they can also increase the size of the code which is already bigger due to the partial unrolling).