Cython optimisation for ising model

This is my first post, I am trying to make an ising model simulation using a monte carlo algorithm. I order to optimise the code I am currently trying to use Cython. However for now the execution time of the python and cython programm are identical. If anyone has an idea to either improve the cython implementation or to make the program run faster without using cython please let me know.

Thanks

PS : here is the code

#cython: language_level=3
import numpy as np
import time
cimport numpy as np




cpdef gen_neighbors_2d(int a):
    cdef int n = a * a
    cdef np.ndarray neighbors = np.zeros((n, 4), dtype=np.int32)

    cdef int i, j, index
    for i in range(a):
        for j in range(a):
            index = i * a + j
            neighbors[index, 0] = i * a + (j - 1) % a # left
            neighbors[index, 1] = i * a + (j + 1) % a # right
            neighbors[index, 2] = ((i - 1) % a) * a + j # up
            neighbors[index, 3] = ((i + 1) % a) * a + j # down

    return neighbors

cpdef int MH_single_flip(np.ndarray[int, ndim=2] neighbors_list, double T, int iterations, int size):
    cdef np.ndarray[int, ndim=1] spins = np.random.choice([-1, 1], size=size)

    cdef int step, n, spin, delta_E
    for step in range(iterations):
        n = np.random.randint(0, size)
        spin = spins[n]

        delta_E = 2 * spin * np.add.reduce(spins[neighbors_list[n]])

        if delta_E < 0 or -T*np.log(np.random.random(1)[0]) > delta_E:
            spins[n] = -1*spin

    return spins




neighbors_list = gen_neighbors_2d(100)

T_list = np.linspace(1, 4, 1000)
size = neighbors_list.shape[0]
t1 = time.time()
for T in range(len(T_list)):

    blank_var = MH_single_flip(neighbors_list, T_list[T], 5000, size)
t2 = time.time()
print(t2-t1)

I at first tried to run the cython code without any variable definitions but it only made the code run 5 percent faster. Any of the upgrades I added after that increased performance by more than 1 percent. Any help is welcome !

Solution

Let's time your implementation as a benchmark reference:

%timeit MH_single_flip(neighbors_list, T, 5000, size)

>>> 15 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here are some hints in arbitrary order (you can find some of them also in DavidW's excellent answer linked in the comments)

Calling numpy functions like np.random.randint and np.random.random with scalar arguments inside a loop is pretty inefficient as function calls in Python have noticeable overhead. As numpy ufuncs call C code under the hood, most of the time is spent for checking the function arguments on the Python level rather than the calculation itself. You should always try to do as much (vectorized) calculations as possible with a single numpy function call. That way, you can benefit from the fast calculations under the hood and don't need to worry much about the Python overhead.
Similarly, np.add.reduce only computes the sum of four integers, so it's dominated by the Python overhead. Here, using a plain C function is reasonable and faster alternative.
You should prefer typed memoryviews to the old typed np.ndarrays. The syntax is cleaner and they are slightly faster.
You can trade safety for speed by disabling all index checks of the numpy arrays / memoryviews by the Cython directive boundschecks.

Here's some code snippet:

%%cython

import numpy as np
from cython import wraparound, boundscheck, cdivision, initializedcheck

@wraparound(False)
@boundscheck(False)
@initializedcheck(False)
@cdivision(True)
def gen_neighbors_2d(int a):
    cdef int n = a * a
    cdef int[:, ::1] neighbors = np.zeros((n, 4), dtype=np.int32)
    cdef int i, j, index
    for i in range(a):
        for j in range(a):
            index = i * a + j
            neighbors[index, 0] = i * a + (j - 1) % a # left
            neighbors[index, 1] = i * a + (j + 1) % a # right
            neighbors[index, 2] = ((i - 1) % a) * a + j # up
            neighbors[index, 3] = ((i + 1) % a) * a + j # down
    return np.asarray(neighbors)

@wraparound(False)
@boundscheck(False)
@initializedcheck(False)
cdef int calculate_delta_e(int spin, long[::1] spins, int[::1] neighbors_list):
    return 2 * spin * (spins[neighbors_list[0]] + spins[neighbors_list[1]] + spins[neighbors_list[2]] + spins[neighbors_list[3]])

@wraparound(False)
@boundscheck(False)
@initializedcheck(False)
def MH_single_flip(int[:, ::1] neighbors_list, double T, int iterations, int size):
    cdef long[::1] spins = np.random.choice([-1, 1], size=size)
    cdef int[::1] nns = np.random.randint(0, size, iterations, dtype=np.int32)
    cdef double[::1] log_rand_nums = np.log(np.random.random(iterations))
    cdef int i, delta_E, n, spin
    for i in range(iterations):
        n = nns[i]
        spin = spins[n]
        delta_E = calculate_delta_e(spin, spins, neighbors_list[n])
        if delta_E < 0 or -T*log_rand_nums[i] > delta_E:
            spins[n] = -1*spin
    return np.asarray(spins)

On my machine, this is approximately 75 times faster than your version:

%timeit MH_single_flip(neighbors_list, T, 5000, size)

>>> 213 µs ± 527 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)