c++c parallel-processing openmp conways-game-of-life

OpenMP with Game of Life visualization using SFML

Hello I'm trying to compare speeds between serial and parallel version of 'Game of Life'. I used SFML library to visualize game of life like this. SFML window Serial logic is simple like below.

for (int i = 0; i < height; i++) {
            for (int j = 0; j < width; j++) {
                int neighbor = 0;

                // check 8 cells around.
                // 1 2 3  -1
                // 4   5  0
                // 6 7 8  +1

                // (1)
                if (gamefieldSerial.isAvailableCell(UP(i), LEFT(j))) {
                    if(gamefieldSerial[UP(i)][LEFT(j)] == LIVE) neighbor++;
                }
                // (2)
                if (gamefieldSerial.isAvailableCell(UP(i), j)) {
                    if (gamefieldSerial[UP(i)][j] == LIVE)      neighbor++;
                }
                // (3)
                if (gamefieldSerial.isAvailableCell(UP(i), RIGHT(j))) {
                    if (gamefieldSerial[UP(i)][RIGHT(j)] == LIVE)   neighbor++;
                }
                // (4)
                if (gamefieldSerial.isAvailableCell(i, LEFT(j))) {
                    if (gamefieldSerial[i][LEFT(j)] == LIVE)        neighbor++;
                }
                // (5)
                if (gamefieldSerial.isAvailableCell(i, RIGHT(j))) {
                    if (gamefieldSerial[i][RIGHT(j)] == LIVE)       neighbor++;
                }
                // (6)
                if (gamefieldSerial.isAvailableCell(DOWN(i), LEFT(j))) {
                    if (gamefieldSerial[DOWN(i)][LEFT(j)] == LIVE)  neighbor++;
                }
                // (7)
                if (gamefieldSerial.isAvailableCell(DOWN(i), j)) {
                    if (gamefieldSerial[DOWN(i)][j] == LIVE)        neighbor++;
                }
                // (8)
                if (gamefieldSerial.isAvailableCell(DOWN(i), RIGHT(j))) {
                    if (gamefieldSerial[DOWN(i)][RIGHT(j)] == LIVE) neighbor++;
                }

                // -- Rule of Game of Life
                // Cell borns when exactly 3 neighbor is LIVE
                // Cell remains alive when 2 or 3 neighbor is LIVE
                // Cell with more than 3 neighbor dies with overpopulation
                // Cell with less than 2 neighbor dies with underpopulation
                if (gamefieldSerial[i][j] == DEAD) {
                    if (neighbor == 3) {
                        gamefieldSerial[i][j] = LIVE;
                    }
                }
                else if (gamefieldSerial[i][j] == LIVE) {
                    if (neighbor < 2 || neighbor > 3) {
                        gamefieldSerial[i][j] = DEAD;
                    }
                }
            }

It took 3940ms on 768*256 cells with 100 generations. But in parallel version I implemented like below,

#pragma omp parallel for num_threads(4)
        for (int t = 0; t < width * height; t++) {
            int i = t / width;
            int j = t % width;
            int neighbor = 0;

            // check 8 cells around.
            // 1 2 3  -1
            // 4   5  0
            // 6 7 8  +1

            // (1)
            if (gamefieldParallel.isAvailableCell(UP(i), LEFT(j))) {
                if (gamefieldParallel[UP(i)][LEFT(j)] == LIVE) neighbor++;
            }
            // (2)
            if (gamefieldParallel.isAvailableCell(UP(i), j)) {
                if (gamefieldParallel[UP(i)][j] == LIVE)      neighbor++;
            }
            // (3)
            if (gamefieldParallel.isAvailableCell(UP(i), RIGHT(j))) {
                if (gamefieldParallel[UP(i)][RIGHT(j)] == LIVE)   neighbor++;
            }
            // (4)
            if (gamefieldParallel.isAvailableCell(i, LEFT(j))) {
                if (gamefieldParallel[i][LEFT(j)] == LIVE)        neighbor++;
            }
            // (5)
            if (gamefieldParallel.isAvailableCell(i, RIGHT(j))) {
                if (gamefieldParallel[i][RIGHT(j)] == LIVE)       neighbor++;
            }
            // (6)
            if (gamefieldParallel.isAvailableCell(DOWN(i), LEFT(j))) {
                if (gamefieldParallel[DOWN(i)][LEFT(j)] == LIVE)  neighbor++;
            }
            // (7)
            if (gamefieldParallel.isAvailableCell(DOWN(i), j)) {
                if (gamefieldParallel[DOWN(i)][j] == LIVE)        neighbor++;
            }
            // (8)
            if (gamefieldParallel.isAvailableCell(DOWN(i), RIGHT(j))) {
                if (gamefieldParallel[DOWN(i)][RIGHT(j)] == LIVE) neighbor++;
            }

            // -- Rule of Game of Life
            // Cell borns when exactly 3 neighbor is LIVE
            // Cell remains alive when 2 or 3 neighbor is LIVE
            // Cell with more than 3 neighbor dies with overpopulation
            // Cell with less than 2 neighbor dies with underpopulation
            if (gamefieldParallel[i][j] == DEAD) {
                if (neighbor == 3) {
                    gamefieldParallel[i][j] = LIVE;
                }
            }
            else if (gamefieldParallel[i][j] == LIVE) {
                if (neighbor < 2 || neighbor > 3) {
                    gamefieldParallel[i][j] = DEAD;
                }
            }
        }

It took 5746ms on same environment. I thought applying openMP's 'for' directive in for-loop enhances the performance, but it doesn't. Should I have to approach in another way?

============= Both gamefieldParallel and gamefieldSerial is instance of GameField class which has dynamically allocated int** field variable for cells. I'm using operator overloading to access it like two dimensional array.(Sorry for bad english!)

Solution

In both your serial version and your parallel version you are updating the game field as you iterate through it. This is a bug with both versions. Say you calculate the new state for gameField[0][0] and then update it. Now you move onto gameField[0][1] --- as part of calculating its new state, you look left to gameField[0][0] which already contains that cell's newly-updated state, but the Game of Life's rules have to be applied to the first cell's previous state.

In other words, you should have a read-only (const) oldGameField and then fill new states into newGameField. Once you have calculated all cells, you can use the new field as the old field for the next game iteration.

Fixing that bug is actually an important part of sorting out your performance problem.

Multi-threading

Instead of thinking of 4 processors doing these updates, imagine 4 people doing it with pencil and paper. Because you will now treat oldGameField as read-only, it is safe to photocopy the old page and give a copy to each person. Each person knows that no-one else is going to change the old page, so they don't have to worry about their copy ever becoming out of date.

But you sill only have one page for the newGameField. In your serial approach, there is only one person, who has the page and the pencil exclusively to themselves. Now you have four people trying to draw on the same page at the same time. They waste more time passing the pencil and page amongst themselves than the time they spend on doing the calculations! It literally takes four people longer to do the job than one person could have done it alone.

This is not meant to be an exact representation of what's going on inside hardware, but when we consider any locking and/or atomics OpenMP might be using, and things like the memory caches in your processor's cores --- it's pretty close.

So how do we fix it?

You and your three friends could decide that each of you would update one quarter of the field. Maybe you take the whole top quarter of the board. The next person takes the second quarter, and so on. And rather than fighting over one sheet of paper to draw the new game field, you each have your own piece of paper to draw just your quarter of the new states.

Once you are all done, you quickly paste your four pieces of paper together to make the new game field page.

The point here is to make sure that each thread is reading from memory that no-one is changing, and that each thread writes only to memory that no other thread is accessing. This means that cores do not block each other with memory writes, and do not have to flush their caches when they see other cores writing to shared memory.

Making sure that each thread's new game field memory isn't close enough to cause interference to the memory used by some other thread is tricky. You generally need to know some information on the size of the cache lines in your cores, whether or not your platform use something called "NUMA", etc etc.

I don't know OpenMP - but perhaps it has some built-in support to help with this.

In summary

Separate the old state from the new state
Your threads can all happily share the old state (because no-one is changing it during an iteration)
Break the work into chunks - one for each thread
Give each thread its own piece of memory to store its results.
Once all threads are finished, have you main thread bring all their results together into one combined new game field.