I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference.
Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU @ 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU @ 2.5GHz × 2 64bit. 8GB RAM
V1 - you can find the full code here. the only changes made are renaming go.py
, preprocessing.py
to go.pyx
, preprocessing.pyx
and using
import pyximport; pyximport.install()
to compile them. you can run test.py
. This version is using a 2d numpy array board
to store data in go.pyx
and list comprehension in the get_board
function in preprocessing.pyx
to process data. during the test no function is called from go.py
only the numpy array board
is used
V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd
. you can run test.py
using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:
cdef char board[ 362 ]
and the function get_board_feature
in go.pyx
replaces numpy list comprehension:
cdef char get_board_feature( self, short location ):
# return correct board feature value
# 0 active player stone
# 1 opponent stone
# 2 empty location
cdef char value = self.board[ location ]
if value == EMPTY:
return 2
if value == self.player_current:
return 0
return 1
get_board
function in preprocessing.pyx
is replaced with a function that loops over the array and calls get_board_feature
in go.pyx
for every location
@cython.boundscheck(False)
@cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
"""A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
always refers to the current player and plane 1 to the opponent
"""
cdef short location
for location in range( 0, state.size * state.size ):
tensor[ offSet + state.get_board_feature( location ), location ] = 1
return offSet + 3
Please let me know if i should include any other information or run certain tests.
cmp, diff test
the V2 go.c
and preprocessing.c
files are identical.
V1 does not generate a .c
file to compare
update compared .so
files
the V2 go.so
files are different:
goD.so goL.so differ: byte 473, line 1
the preprocessing.so
files are identical, not sure what to think of that..
They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.
Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.