Change values of multiple cells in CUDA

It has to be a simple one, though I can't find an answer. I'm writing a program which has to calculate states of cellular automatons and in order to get a feeling how does CUDA works I tried to write a very simple program first. It takes a matrix, and every thread has to increment a value in its cell and in the cells which are above and below of this cell. So, if i give it the following matrix:

[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]

I expect to get the following result:

[2 2 2 2 2 2 2]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[2 2 2 2 2 2 2]

The first row has values of 2, because it doesn't have a row above which could increment values of first row one more time. And in a similar manner the last row has values of 2.
But I'm getting a matrix which looks like this:

[2 2 2 2 2 2 2]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[3 3 3 3 2 2 2]
[2 2 2 2 2 2 2]
[2 2 2 2 3 3 3]
[2 2 2 2 2 2 2]

And I can't understand why there are values of 2 in the 4th, 5th and ath 6th row - there have to be 3, not 2.
Here goes my code:

import numpy
import pycuda.autoinit
import pycuda.driver as cuda

from pycuda.compiler import SourceModule

w = 7

mod = SourceModule("""
        __global__ void diffusion(  int* result, int width, int height) {

            int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
            int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

            int flatIndex = xIndex + width * yIndex;
            int topIndex = xIndex + width * (yIndex - 1);
            int bottomIndex = xIndex + width * (yIndex + 1);

            int inc = 1;

            result[flatIndex] += inc;

            result[bottomIndex] += inc;

            result[topIndex] += inc;
        }

        """)

diff_func   = mod.get_function("diffusion")


def diffusion(res):

    height, width = numpy.int32(len(res)), numpy.int32(len(res[0]))

    diff_func(
        cuda.InOut(res),
        width,
        height,
        block=(w,w,1)
        )

def run(res, step):

    diffusion(res)
    print res

res   = numpy.array([[0 \
                        for _ in xrange(0, w)]\
                        for _ in xrange(0, w)], dtype='int32')

run(res, 0)

One more interesting thing: if I comment one of the following lines:

result[bottomIndex] += inc;
result[topIndex] += inc;

Everything works as expected and there aren't any unexpected values. It looks like in some cases CUDA can't work with values of three adjacent cells in one thread.

Solution

You have what is know as a memory race: multiple independent threads are attempting to update the same value in memory at the same time. The CUDA memory model doesn't define what happens when two threads try to update the same memory location simultaneously.

The solution is either to use atomic memory operations (see the CUDA programming guide for more information), or a different approach for updating adjacent cells (for example, colour the grid and update like coloured cells on separate passes through the grid).