Execution Speed: Cython vs ctypes

I am learning about different ways to interface python and C. I have this function that takes the sum of integers between 0 and the function input. I've coded this function in python, cython, and C (interfaced using ctypes). Then, I timed the execution time using an input of 5000 and running the function 1000 times, and these were the results:

Python: ~0.2 s
Ctypes: ~0.01 s (~20x faster)
Cython: ~0.0013 s (~154x faster)

Cython is way faster than the ctypes approach (where I coded the C myself). What makes cython so much faster than the ctypes approach? I provide all the details below:

Python setup

example_py.py:

def sumTo(x):
    y = 0
    for i in range(x):
        y += i
    return y

Cython setup

example_cy.pyx:

cpdef int sumTo(int x):
    cdef int y = 0
    cdef int i
    for i in range(x):
        y += i
    return y

Setup file: setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(ext_modules = cythonize('example_cy.pyx'))

Compile by running python setup.py build_ext --inplace.

C setup

example_C.h:

#ifndef EXAMPLE_C_
#define EXAMPLE_C_

int sumTo(int x);

#endif

example_C.c:

#include <stdio.h>
#include "example_C.h"

int sumTo(int x) {
    int y = 0;

    for(int i = 0; i < x; i++) {
        y += i;
    }

    return y;
}

Compile by running gcc -shared -o libcalci.so -fPIC example_C.c.

Testing scripts

To test the ctypes approach, I ran this script:

import example_py
from ctypes import *
import time

numRuns = 1000
x = 5000

# Run test on python script
tic = time.perf_counter()
for i in range(numRuns):
    example_py.sumTo(x)
py_runtime = time.perf_counter() - tic

# Run test on c script
libCalc = CDLL("./libcalci.so")
tic = time.perf_counter()
for i in range(numRuns):
    libCalc.sumTo(x)
c_runtime = time.perf_counter() - tic

# Print results
print(py_runtime, c_runtime)
print('Ctypes is {}x faster'.format(py_runtime/c_runtime))

To test the cython approach, I ran this script:

import time
import example_cy
import example_py

numRuns = 1000
x = 5000

# Run test on python script
tic = time.perf_counter()
for i in range(numRuns):
    example_py.sumTo(x)
py_runtime = time.perf_counter() - tic

# Run test on c script
tic = time.perf_counter()
for i in range(numRuns):
    example_cy.sumTo(x)
c_runtime = time.perf_counter() - tic

# Print results
print(py_runtime, c_runtime)
print('Cython is {}x faster'.format(py_runtime/c_runtime))

Any thoughts on why ctypes approach is so much slower than the cython approach? Thank you for your time and wisdom!

Solution

Thanks to Jérôme Richard! The ctypes approach was resulting in tests slower than the cython approach since I wasn't using optimization flags when compiling the C code.

As mentioned above, using the -O3 flag sped the code from the ctypes approach up to 214x faster than python, using -O3 -mavx2 sped it up to 315x faster, and using -O3 -mavx2 -march=native sped it up to 325x faster.