Search code examples

vectorization of looping on an array from cython

Consider the following example of doing an inplace-add on a Cython memoryview:

#cython: boundscheck=False, wraparound=False, initializedcheck=False, nonecheck=False, cdivision=True
from libc.stdlib cimport malloc, free
from libc.stdio cimport printf
cimport numpy as np
import numpy as np

cdef extern from "time.h":
    int clock()

cdef void inplace_add(double[::1] a, double[::1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

cdef void inplace_addlocal(double[::1] a, double[::1] b):
    cdef int i, n = a.shape[0]
    for i in range(n):
        a[i] += b[i]

def main(int N):
        int rep = 1000000, i
        double* pa = <double*>malloc(N * sizeof(double))
        double* pb = <double*>malloc(N * sizeof(double))
        double[::1] a = <double[:N]>pa
        double[::1] b = <double[:N]>pb
        int start
    start = clock()
    for i in range(N):
        a[i] = b[i] = 1. / (1 + i)
    for i in range(rep):
        inplace_add(a, b)
    printf("loop %i\n", clock() - start)
    start = clock()
    for i in range(N):
        a[i] = b[i] = 1. / (1 + i)
    for i in range(rep):
        inplace_addlocal(a, b)
    printf("loop_local %i\n", clock() - start)

With these Cython directives, the seemingly equivalent inplace_add and inplace_addlocal both compile to tight C loops. But for N=128 (the approximate size I'm expecting) inplace_addlocal is twice(!) faster than inplace_add, after compilation with gcc -Ofast (and directly writing a C function taking a (int, double*, double*) is more or less as fast as addlocal, with or without #openmp simd). Passing -fopt-info to gcc shows that inplace_addlocal gets vectorized, but not inplace_add.

Is this an issue with the C code that Cython generates (i.e., gcc truly cannot infer whatever guarantees it needs to vectorize the code), or with gcc (i.e., some optimization is missing), or something else?


(cross-posted to cython-users)


  • The only difference for the generated C code is that in inplace_addlocal the end variable for the loop is an int, while in inplace_add it's a Py_ssize_t.

    Since your loop counter is an int, in the inplace_add version, there would be an aditional overhead due to casting between the two types when the comparison is performed.

    inplace_add (relevant section)

    Py_ssize_t __pyx_t_1;
    int __pyx_t_2;
    int __pyx_t_3;
    int __pyx_t_4;
    __pyx_t_1 = (__pyx_v_a.shape[0]);
    for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
      __pyx_v_i = __pyx_t_2;

    inplace_addlocal (relevant section)

    int __pyx_t_1;
    int __pyx_t_2;
    int __pyx_t_3;
    int __pyx_t_4;
    __pyx_v_n = (__pyx_v_a.shape[0]);
    __pyx_t_1 = __pyx_v_n;
    for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
      __pyx_v_i = __pyx_t_2;

    This answer mentions that is it preferable to use Py_ssize_t for indices (and it must be assumed by default in Cython), which would solve this problem.