Seemingly nonsensical runtime increases when switching from pure C to C with Numpy objects

Introduction

I am trying to realise some number crunching on a one-dimensional array in C (herafter: standalone) and as a Numpy module written in C (herafter: module) simultaneously. Since all I need to do with the array is to compare selected elements, I could use an abstraction layer for the array access and thus I can use the same code for the standalone or the module.

Now, I expect the module to be somewhat slower, since comparing elements of a Numpy array of unknown type using descr->f->compare requires extra function calls and similar and thus is more costly than the analogous operation for a C array of known type. However, when looking at the output of a profiler (Valgrind), I found runtime increases in the module for lines which have no obvious connection to the Python methods. I want to understand and avoid this, if possible.

Minimal example

Unfortunately, my minimal example is quite lengthy. Note that the Python variant is no real module anymore due to example reduction.

# include <stdlib.h>
# include <stdio.h>

# ifdef PYTHON
    # include <Python.h>
    # include <numpy/arrayobject.h>

    // Define Array creation and access routines for Python.

    typedef PyArrayObject * Array;

    static inline char diff_sign (Array T, int i, int j)
    {
        return T->descr->f->compare ( PyArray_GETPTR1(T,i), PyArray_GETPTR1(T,j), T );
    }

    Array create_array (int n)
    {
        npy_intp dims[1] = {n};
        Array T = (Array) PyArray_SimpleNew (1, dims, NPY_DOUBLE);
        for (int i=0; i<n; i++)
            {* (double *) PyArray_GETPTR1(T,i) = i;} // Line A
        return T;
    }
#endif

# ifdef STANDALONE
    // Define Array creation and access routines for standalone C.

    typedef double * Array;

    static inline char diff_sign (Array T, int i, int j)
    {
        return (T[i]>T[j]) - (T[i]<T[j]);
    }

    Array create_array (int n)
    {
        Array T = malloc (n*sizeof(double));
        for (int i=0; i<n; i++) {T[i] = i;} // Line B
        return T;
    }
# endif

int main()
{
    # ifdef PYTHON
        Py_Initialize();
        import_array();
    # endif

    // avoids that the compiler knows the values of certain variables at runtime.
    int volatile blur = 0;

    int n = 1000;
    Array T = create_array (n);

    # ifdef PYTHON
        for (int i=0; i<n; i++)
            {* (double *) PyArray_GETPTR1(T,i) = i;} // Line C
    # endif
    # ifdef STANDALONE
        for (int i=0; i<n; i++) {T[i] = i;} // Line D
    #endif

    int a = 333 + blur;
    int b = 444 + blur;
    int c = 555 + blur;
    int e = 666 + blur;
    int f = 777 + blur;
    int g =   1 + blur;
    int h =   n + blur;

    // Line E                           standa.  module
    for (int i=h; i>0; i--)             // 4000  8998
    {
        int d = c;
        do c = (c+a)%b;                 // 4000  5000
        while (c>n-1);                  // 2000  2000

        if (c==e) f*=2;                 // 3000  3000
        if ( diff_sign(T,c,d)==g ) f++; // 5000  5000
    }

    printf("%i\n", f);
}

I compiled this with the following two commands:

gcc source.c -o standalone -O3 -g -std=c11 -D STANDALONE
gcc source.c -o module -O3 -g -std=c11 -D PYTHON -lpython2.7 -I/usr/include/python2.7

Changing to -O2 does not change the following; changing the compiler to Clang does change the minimal example but not the phenomenon with my actual code.

Profiling results

The interesting things happen after Line E and I gave the total runtime spent in those lines as reported by the profiler as comments in the source code: Despite having no direct relation to whether I compile as standalone or module, the runtimes for these lines strongly differ. In particular, in my actual application, the additional time spent in those lines in the module makes up for one fourth of the module’s total runtime.

What’s even more weird is that if I remove line C (and D) – which is redundant in the example, as the array’s values are already set (to the same values) in line A (and B) –, the runtime spent in the loop header is reduced from 8998 to 6002 (the other reported runtimes do not change). The same thing happens, if I change int n = 1000; to int n = 1000 + blur;, i.e., if I make n unknown compile time.

This does not make much sense to me and since it has a relevant impact on the runtime, I would like to avoid it.

Questions

Where do these runtime increases come from. I am aware that compilers are not perfect and sometimes work in seemingly mysterious ways, but I would like to understand.
How can I avoid these runtime increases?

Solution

you have to be very careful when interpreting callgrind profiles. Callgrind gives you the instruction fetch count, so the number of instructions. This is not connected to actual performance on modern cpus, as instructions can have different latencies and throughputs and can be reordered by suitably capable cpus.

Also you are here matching the instruction fetch to the lines the debug symbols associate to them. Those do not correspond exactly, e.g. the module code associates the a register copy and a nop instruction (which are essentially free in terms of runtime compared to the following division) to the loop line the source code, while the standalone module associates it to the line above. You can see that in the machine code tab when using --dump-instr=yes in kcachegrind. This is will have something to do with different registers being available for the two variants due to the different number of function calls that imply spilling stuff onto the stack.

Lets look at the modulo loops to see if there is a significant runtime difference:

module:

  400b58:       42 8d 04 3b             lea    (%rbx,%r15,1),%eax
  400b5c:       99                      cltd   
  400b5d:       41 f7 fe                idiv   %r14d
  400b60:       81 fa e7 03 00 00       cmp    $0x3e7,%edx
  400b66:       89 d3                   mov    %edx,%ebx
  400b68:       7f ee                   jg     400b58 <main+0x1b8>

standalone:

  4005f0:       8d 04 32                lea    (%rdx,%rsi,1),%eax
  4005f3:       99                      cltd   
  4005f4:       f7 f9                   idiv   %ecx
  4005f6:       81 fa e7 03 00 00       cmp    $0x3e7,%edx
  4005fc:       7f f2                   jg     4005f0 <main+0x140>

the difference is one register to register copy mov %edx,%ebx (likely again caused by different register pressure due to earlier function calls) this is one of the cheapest operations available in a cpu probably around 1-2 cycles and good throughput, so it should have no measurable effect on the actual wall time. The idiv instruction is the expensive part, it should be around 20 cycles with poor throughput. So the instruction fetch count here is grossly misleading.

A better tool for such detailed profiling is a sampling profiler like perf record/report. When you run long enough you will be able to single out instructions that are costing a lot of time, though the actually high sample counts will also then not match up directly with the slow instructions as the cpu may execute later independent instructions in parallel with the slow ones.