How to calculate average value by masked value/texture on GPU?

Background (if helpful)

I'm writing a GPU-Based Real-Time Preview Lightmapper with NVIDIA RTX-Cards on Windows.

For any scene ready to be baked, I write:

lighting data in screen space to a 1920x1080 RWTexture2D<float3> TexColor, which is considered the result at infinite-precision lightmap resolution for each geometry surface.
MakedID, combine(objectId, uint2(lightmapUV.x, lightmapUV.y)) used to uniquely identify a lightmap texel, to a 1920x1080 RWTexture2D<uint> TexMasked, which is used to filter TexColor to calculate the result at a given lightmap resolution.

Finally, I apply the result to the scene to provide the lightmap preview.

What I want to do is:

RWTexture<float3>TexColor; // RBGA32
Texture<uint> TexMasked;   // R32

void get_masked_mean(uint2 pixelPos)
{
    //? How do it on GPU?
    // TexColor[pixelPos] = calculateAverage(TexColor[pixel]) for all
    // pixels satisfying (TexMasked[pixel] == TexMasked[pixel])
}

I try to readbak TexColor and TexMasked to CPU, but it's very solw:


std::unordered_map<uint, float4> resultMap;
std::vector<std::vector<float3>> colorData[][]  = TexColor.readbakToCPU(); // very slow
std::vector<std::vector<uint>>   maskedData[][] = TexMasked.readbakcToCPU();

for (int r = 0; r < rowCount; ++r)     // 1920
{
    for (int c = 0; c < colCount; ++c) // 1080
    {
       uint key = maskedData[r][c];
       uint oldCount = resultMap[key].w;
       uint newCount = oldCount + 1;

       // very slow
       resultMap[key] = float4((colorData[r][c] + count * resultMap[key].xyz) / newCount), newCount);
    }
}

for (int r = 0; r < rowCount; ++r)
{
    for (int c = 0; c < colCount; ++c)
    {
       
       uint key = maskedData[r][c];
       colorData[r][c] = resultMap[key].xyz; // very solw
    }
}

TexColor.uploadToGPU(colorData);

Read data from GPU to CPU, plus ~1,000,000 hash queries are very very slow. Is there any more efficient CPU codes here? Or GPU-Based code (compute shader)?

Solution

Well, there are many ways to optimize this code.

If you want to stick with CPU-based computation (which is much easier to do, but will also always be significantly slower or come with other drawbags), the first thing I would do, is to find a faster hash map implementation. You're doing about 4.000.000 hash table look ups, so even small improvements in hash table access speed can make a measurable difference. It won't be a huge gain, but it's a fairly low hanging fruit.

If you do this calculation many times (eg when rendering a single frame in an interactive application), you could also pipeline the work: Instead of accessing the texture memory right away, you always access the memory from a few frames ago, and the GPU also later accesses the uploaded memory from a few frames ago. This increases memory cost a lot (since you have to keep multiple frames in flight at all times), and the GPU will always only see the result of the computation from a couple frames ago, but this will also completely remove the performance cost of reading the textures on the CPU and uploading the result back to the GPU, if you keep a sufficient number of frames in flight. Obviously, if this is a one-time computation, or if it's not acceptable to have a delayed result on the GPU, this isn't an option.

You could then further optimize the code by using multiple threads, though that work would probably not be worth it.

Doing this computation on the GPU is possible, but much more involved. Since GPUs are highly parellized, implementing a hash map on the GPU is very challenging. Instead, I would use multiple shaders to re-organize the data until you don't need a hash table anymore. The basic idea is, to sort your data by key (store the original position of each entry in the sorted data), then to use prefix sum to 'flatten' all successive entries with the same key into a single number, which you then can use to schedule work for each unique key. The exact implementation of this depends on a few details - for example, you might use indirect execution to schedule one thread (or one thread group/etc) per unique key, or you might re-write your math such that it can be done atomically, and then run one thread per pixel (note that atomic operations on floats are only supported with some vendor specific extensions, and are not available on every hardware). At the end, using the stored original position of each entry before the sort, and the position of the entry in the sorted data, you can write the result into each pixel.

This whole process requires multiple compute shaders and additional buffers, and writing compute shaders and this whole pipeline in a way that they're fast and efficient is a topic of it's own. But if you put the work and time into it, you can make this much faster than any CPU side approach (taking probably less than 1 ms on good hardware, I would guess).