Search code examples
numpyoptimizationcythonbaduk

Fastest Cython implementation depends on computer?


I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference. Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU @ 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU @ 2.5GHz × 2 64bit. 8GB RAM

V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used

V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:

cdef char board[ 362 ]

and the function get_board_feature in go.pyx replaces numpy list comprehension:

cdef char get_board_feature( self, short location ):
    # return correct board feature value
    # 0 active player stone
    # 1 opponent stone
    # 2 empty location

    cdef char value = self.board[ location ]

    if value == EMPTY:
        return 2

    if value == self.player_current:
        return 0

    return 1

get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location

@cython.boundscheck(False)
@cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
    """A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
       always refers to the current player and plane 1 to the opponent
    """

    cdef short location

    for location in range( 0, state.size * state.size ):

        tensor[ offSet + state.get_board_feature( location ), location ] = 1

    return offSet + 3

Please let me know if i should include any other information or run certain tests.

cmp, diff test
the V2 go.c and preprocessing.c files are identical. V1 does not generate a .c file to compare

update compared .so files
the V2 go.so files are different:

goD.so goL.so differ: byte 473, line 1

the preprocessing.so files are identical, not sure what to think of that..


Solution

  • They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.

    Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.