I was trying to learn and use cython to speed up my personal project but I noticed something weird.
Example:
Trying out on rbf_network example from http://nealhughes.net/cython1/
import pyximport; pyximport.install()
from src.test_cython import rbf_network # File1: /src/test_cython.pyx
# from src.test import rbf_network # File2: /src/test.py
import time
import cProfile
import numpy as np
def fun():
D = 5
N = 1000
X = np.array([np.random.rand(N) for d in range(D)]).T
beta = np.random.rand(N)
theta = 10
rbf_network(X, beta, theta)
# With CProfile
cProfile.run('fun()', sort='cumtime')
# Without Cprofile
start = time.time()
fun()
print("Time without CProfile: ", time.time() - start)
Both File1 and File2 contain:
from math import exp
import numpy as np
def rbf_network(X, beta, theta):
N = X.shape[0]
D = X.shape[1]
Y = np.zeros(N)
for i in range(N):
for j in range(N):
r = 0
for d in range(D):
r += (X[j, d] - X[i, d]) ** 2
r = r**0.5
Y[i] += beta[j] * exp(-(r * theta)**2)
return Y
Output on File1 (cythonized):
13 function calls in 3.920 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.920 3.920 {built-in method builtins.exec}
1 0.000 0.000 3.920 3.920 <string>:1(<module>)
1 0.000 0.000 3.920 3.920 run.py:138(fun)
1 3.920 3.920 3.920 3.920 {src.test_cython.rbf_network}
1 0.000 0.000 0.000 0.000 run.py:141(<listcomp>)
6 0.000 0.000 0.000 0.000 {method 'rand' of 'mtrand.RandomState' objects}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Time without CProfile: 3.899562120437622
Output on File2 (non cython):
1000014 function calls in 13.193 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 13.193 13.193 {built-in method builtins.exec}
1 0.000 0.000 13.193 13.193 <string>:1(<module>)
1 0.000 0.000 13.193 13.193 run.py:138(fun)
1 7.948 7.948 13.193 13.193 test.py:4(rbf_network)
1000000 5.245 0.000 5.245 0.000 {built-in method math.exp}
1 0.000 0.000 0.000 0.000 run.py:141(<listcomp>)
6 0.000 0.000 0.000 0.000 {method 'rand' of 'mtrand.RandomState' objects}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Time without CProfile: 4.139716863632202
In short, Cythonized code seems to improve from 13.19 sec to 3.920 sec when measured using cProfile, but when measured using inbuilt time, the improvement is really only from 4.13 sec to 3.89 sec.
Cython does provide some speedup (even when used naively) but measuring the speed-up by time profilers seems to inflate the results. Perhaps these time profilers benefit by using cython more than the core code. Is this true or am I doing something wrong?
Edit: Also, I am not sure why the {built-in method math.exp} is not tracked by cProfile in the cythonized code.
The python profile module docs address this directly
Note The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.