Cython seems to provide speed-up by reducing the overhead in time profiler rather than the core code?

I was trying to learn and use cython to speed up my personal project but I noticed something weird.

Example:

Trying out on rbf_network example from http://nealhughes.net/cython1/

import pyximport; pyximport.install()
from src.test_cython import rbf_network  # File1: /src/test_cython.pyx
# from src.test import rbf_network       # File2: /src/test.py
import time
import cProfile
import numpy as np

def fun():
    D = 5
    N = 1000
    X = np.array([np.random.rand(N) for d in range(D)]).T
    beta = np.random.rand(N)
    theta = 10
    rbf_network(X, beta, theta)

# With CProfile
cProfile.run('fun()', sort='cumtime')

# Without Cprofile
start = time.time()
fun()
print("Time without CProfile: ", time.time() - start)

Both File1 and File2 contain:

from math import exp
import numpy as np

def rbf_network(X, beta, theta):

    N = X.shape[0]
    D = X.shape[1]
    Y = np.zeros(N)

    for i in range(N):
        for j in range(N):
            r = 0
            for d in range(D):
                r += (X[j, d] - X[i, d]) ** 2
            r = r**0.5
            Y[i] += beta[j] * exp(-(r * theta)**2)

    return Y

Output on File1 (cythonized):

     13 function calls in 3.920 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    3.920    3.920 {built-in method builtins.exec}
    1    0.000    0.000    3.920    3.920 <string>:1(<module>)
    1    0.000    0.000    3.920    3.920 run.py:138(fun)
    1    3.920    3.920    3.920    3.920 {src.test_cython.rbf_network}
    1    0.000    0.000    0.000    0.000 run.py:141(<listcomp>)
    6    0.000    0.000    0.000    0.000 {method 'rand' of 'mtrand.RandomState' objects}
    1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Time without CProfile:  3.899562120437622

Output on File2 (non cython):

         1000014 function calls in 13.193 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000   13.193   13.193 {built-in method builtins.exec}
    1    0.000    0.000   13.193   13.193 <string>:1(<module>)
    1    0.000    0.000   13.193   13.193 run.py:138(fun)
    1    7.948    7.948   13.193   13.193 test.py:4(rbf_network)
  1000000    5.245    0.000    5.245    0.000 {built-in method math.exp}
    1    0.000    0.000    0.000    0.000 run.py:141(<listcomp>)
    6    0.000    0.000    0.000    0.000 {method 'rand' of 'mtrand.RandomState' objects}
    1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
    1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.zeros}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Time without CProfile:  4.139716863632202

In short, Cythonized code seems to improve from 13.19 sec to 3.920 sec when measured using cProfile, but when measured using inbuilt time, the improvement is really only from 4.13 sec to 3.89 sec.

Cython does provide some speedup (even when used naively) but measuring the speed-up by time profilers seems to inflate the results. Perhaps these time profilers benefit by using cython more than the core code. Is this true or am I doing something wrong?

Edit: Also, I am not sure why the {built-in method math.exp} is not tracked by cProfile in the cythonized code.

Solution

The python profile module docs address this directly

Note The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.