C++ AMP (GPU) global write not updating quickly enough for all tiles to see?

I assume this is a GPU issue more than a C++ AP issue, so I tag it broadly.

I have an implementation of a calculation which splits the work into a number of tiles which do their work and then add the result to the existing value in global memory. First every thread in the tile calculates their part result to tile_static memory, where each thread has an index to write into. Later the first thread in the tile will sum all the part results together and add the sum to a position in global memory.

The tiles (thread 0 in the tiles) will sometimes want to write to the same location, so I added simple locking.

inline void lock(int *lockVariable) restrict(amp)
{
    while (atomic_exchange(lockVariable, 1) != 0);
}

inline void unlock(int *lockVariable) restrict(amp)
{
    *lockVariable = 0;
}

The lock variable I pass to lock and unlock is in a global array of integers with one integer per contended memory position that the tiles will write into.

The actual write of the tile result, as done by the first thread in the tile, is done like this

//now the FIRST thread in the tile will summ all the pulls into one
if (idx.local[0] == 0)
{                   
  double_4 tileAcceleration = 0;
  for (int i = 0; i < idx.tile_dim0; i++)
  {
    tileAcceleration += threadAccelerations[i];
  }
  lock(&locks[j]);
  //now the FIRST thread in the tile will add this to the global result
  acceleration[j] += tileAcceleration;
  unlock(&locks[j]);
}

This works mostly ok, but not always. Some race condition must exist, because when there are too many tiles relative to the number of memory locations to write in (too much fighting over locks), sometimes it will fail to add the tile result properly.

It seems that sometimes, though rarely, the lock/unlock setup will not ensure correct addition.

This can be "fixed" by moving the lock up in front of the summation, so it takes longer from when the lock is obtained before thread0 does actual writing. I can also "fix" it by taking the lock when I have five elements left in the summation. Both shown below

First fix, which is quite slow (blocking too long)

if (idx.local[0] == 0)
{                   
  lock(&locks[j]); //get lock right away
  double_4 tileAcceleration = 0;
  for (int i = 0; i < idx.tile_dim0; i++)
  {
    tileAcceleration += threadAccelerations[i];
  }
  //now the FIRST thread in the tile will add this to the global result
  acceleration[j] += tileAcceleration;
  unlock(&locks[j]);
}

Second fix, which is a bit faster

if (idx.local[0] == 0)
{                   
  lock(&locks[j]); //this is a "fix" but a slow one
  double_4 tileAcceleration = 0;
  for (int i = 0; i < idx.tile_dim0; i++)
  {
    tileAcceleration += threadAccelerations[i];
    if (i == idx.tile_dim0 - 5) lock(&locks[j]); //lock when almost done
  }
  //now the FIRST thread in the tile will add this to the global result
  acceleration[j] += tileAcceleration;
  unlock(&locks[j]);
}

Seeing how those "fixes" work, it seems obvious that some memory writes are not being updated system wide quickly enough. One tile can lock a location, write to it and unlock. Another tile then gets the lock, does its add (but referring to the old not updated data) and unlocks.

The lock is an int and the data is a double_4, so it seems the lock is quickly enough released and updated for other tiles to see while the data is still in transit. Another tile can then see the lock as being free even though the first tiles write has not been comitted fully yet. The second tile therefore reads the non-updated data value from cache and adds to it and writes...

Can someone please help me understand why exactly the data was not invalidated (in cache) when the first tile did its write, and can someone please help me find a proper solution for this issue?

Solution

In short what you're doing here is not a good solution to the problem. Firstly atomic operations in C++ AMP also have the following limitations:

You should not mix atomic and normal (nonatomic) reads and writes. Normal reads might not see the results of atomic writes to the same memory location. Normal writes should not be mixed with atomic writes to the same memory location. If your program does not comply with these criteria, this will lead to an undefined result.

Atomic operations do not imply a memory fence of any sort. Atomic operations may be reordered. This differs from the behavior of interlocked operations in C++.

So to have your lock function work the unlock function also needs to use an atomic read.

In general you should not try and lock in this way as it is very inefficient. Your program can synchronize operations between threads on the same tile using the tile barrier primitives. Tile operations are only guaranteed to synchronize when a kernel

It looks like what you're trying to do here is some sort of reduction/accumulation operation. Each thread generates a result and then all those results are combined for create a single (final) result.

Here's an example of a simple reduction.

#include <vector>
#include <algorithm>
#include <numeric>
#include <amp.h>

using namespace concurrency;

int Reduce(accelerator_view& view, const std::vector<int>& source) const
{
    const int windowWidth = 8;
    int elementCount = static_cast<unsigned>(source.size());

    // Using array as temporary memory.
    array<int, 1> a(elementCount, source.cbegin(), source.cend(), view);

    // Takes care of the sum of tail elements.
    int tailSum = 0;
    if ((elementCount % windowWidth) != 0 && elementCount > windowWidth)
        tailSum = 
            std::accumulate(source.begin() + ((elementCount - 1) / windowWidth) * windowWidth, 
                source.end(), 0);

    array_view<int, 1> avTailSum(1, &tailSum);

    // Each thread reduces windowWidth elements.
    int prevStride = elementCount;
    for (int stride = (elementCount / windowWidth); stride > 0; stride /= windowWidth)
    {
        parallel_for_each(view, extent<1>(stride), [=, &a] (index<1> idx) restrict(amp)
        {
            int sum = 0;
            for (int i = 0; i < windowWidth; i++)
                sum += a[idx + i * stride];
            a[idx] = sum;

            // Reduce the tail in cases where the number of elements is not divisible.
            // Note: execution of this section may negatively affect the performance.
            // In production code the problem size passed to the reduction should
            // be a power of the windowWidth. 
            if ((idx[0] == (stride - 1)) && ((stride % windowWidth) != 0) && (stride > windowWidth))
            {
                for(int i = ((stride - 1) / windowWidth) * windowWidth; i < stride; i++)
                    avTailSum[0] += a[i];
            }
        });
        prevStride = stride;
    }

    // Perform any remaining reduction on the CPU.
    std::vector<int> partialResult(prevStride);
    copy(a.section(0, prevStride), partialResult.begin());
    avTailSum.synchronize();
    return std::accumulate(partialResult.begin(), partialResult.end(), tailSum);
}

In general, if your parallel code is relying on locks or other explicit synchronization primitives then you should ask if it's really the right approach. If you can explain a bit more what you're trying to achieve then I can probably provide a more specific answer

The text and examples above some from The C++ AMP Book.

BTW: Your code refers to tileAccelleration if you're implementing an n-body model of some sort then you can find a complete implementation in the C++ AMP Book Codeplex Project