Atomic swap on more than one number in an HLSL Compute Shader?

I am trying to implement an in-place falling sand algorithm, let's say i have a 3D texture and i want all spots that are not zero to move one spot down if that spot is empty (0). That is easy to do:

Let's say 'from' is xyz, 'to' is xyz + (0, -1, 0)

uint toValueWas;
InterlockedCompareExchange(_texture[to], 0, _texture[from], toValueWas);
if (toValueWas == 0)
{
    _texture[from] = toValueWas;
}

This works perfectly, but what if i need more storage than one uint? Let's say i want to have another texture b that will give me extra storage and which moves i want to keep in sync with the original texture. I have tried every which way to do this, but the two textures/buffers always go out of sync. For example this doesn't work:

uint toValueWas;
InterlockedCompareExchange(_texture[to], 0, _texture[from], toValueWas);
if (toValueWas == 0)
{
    _texture[from] = toValueWas;
    // Goes out of sync with _texture:
    InterlockedExchange(_textureB[to], _textureB[from], _textureB[from]);
}

The question boils down to: Is there a way to effectively do atomic swaps on more than 32 bits?

I have expected to find any way to atomically swap more than 32 bits of data but was unable to find it.

Solution

https://forum.unity.com/threads/atomic-swap-on-more-than-one-number.1410948/

Okay so after about a week of bashing my head against this I think i finally fixed it by having a bit in texture A that says "has this one been moved yet?", reset in a different kernel at start of every update. Basically restricting the atomic moves to one move per kernel execution, removing the possibility of following construct occurring which complicates things massively: (imagine were swapping everything one index to the right if its empty) thread 0 happens to get executed first: swaps [0] and [1], then thread 1 happens to get executed next: picks up right after and sees that it already can do yet another swap from [1] to [2], etc. Final working code is:

uint _textureFrom = _texture[from];
 
if (bit(_textureFrom, BEEN_MOVED_BIT) == 1) return;
 
_textureFrom = setbit(_textureFrom, BEEN_MOVED_BIT, 1);
 
uint toValueWas;
InterlockedCompareExchange(_texture[_to], 0, _textureFrom, toValueWas);
if (toValueWas == 0)
{
    _texture[from] = 0;
 
    uint __;
    InterlockedExchange(_textureB[to], _textureB[from], __);
}

My guess at the source of the desync was that while the Interlocked* functions lines are indeed perfectly atomic, everything else is not atomic... idk, the more i think about it the more it still doesn't really make sense, but this works. At the very least I can say that one of the keys to the desync happening definitely was the situation where more than one swap was happening in quick succession.