Can I use SIMD to bucket sort / categorize?

I'm curious about SIMD and wondering if it can handle this use case.

Let's say I have an array of 2048 integers, like [0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]

Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:

One for all the integers starting with 0x00
One for all the integers starting with 0x01
One for all the integers starting with 0x02
One for all the integers starting with 0x03

I imagine I would have code like this:

int main() {
   uint16_t in[2048] = ...;

   // 4 arrays, one for each category
   uint16_t out[4][2048];

   // Pointers to the next available slot in each of the arrays
   uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };

   for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {

       (*** magic simd instructions here ***)

       // Equivalent non-simd code:
       uint16_t categories[4];
       for (int i = 0; i < 4; i++) {
           categories[i] = nextIn[i] & 0xFF00;
       }
       for (int i = 0; i < 4; i++) {
           uint16_t category = categories[i];
           *nextOut[category] = nextIn[i];
           nextOut[category]++;
       }
   }
   // Now I have my categoried arrays!
}

I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00) instruction, but I wonder if we can make that second inner loop into a SIMD instruction.

Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?

The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.

If not, does anything come close?

Thanks!

Solution

You can do it with SIMD, but how fast it is will depend on exactly what instruction sets you have available, and how clever you are in your implementation.

One approach is to take the array and "sift" it to separate out elements that belong in different buckets. For example, grab 32 bytes from your array which will have 16 16-bit elements. Use some cmpgt instructions to get a mask where which determines whether each element falls into the 00 + 01 bucket or the 02 + 03 bucket. Then use some kind of "compress" or "filter" operation to move all masked elements contiguously into one end a register and then same for the unmasked elements.

Then repeat this one more time to sort out 00 from 01 and 02 from 03.

With AVX2 you could start with this question for inspiration on the "compress" operation. With AVX512 you could use the vcompress instruction to help out: it does exactly this operation but only at 32-bit or 64-bit granularity so you'd need to do a couple at least per vector.

You could also try a vertical approach, where you load N vectors and then swap between them so that the 0th vector has the smallest elements, etc. At this point, you can use a more optimized algorithm for the compress stage (e.g,. if you vertically sort enough vectors, the vectors at the ends may be entirely starting with 0x00 etc).

Finally, you might also consider organizing your data differently, either at the source or as a pre-processing step: separating out the "category" byte which is always 0-3 from the payload byte. Many of the processing steps only need to happen on one or the other, so you can potentially increase efficiency by splitting them out that way. For example, you could do the comparison operation on 32 bytes that are all categories, and then do the compress operation on the 32 payload bytes (at least in the final step where each category is unique).

This would lead to arrays of byte elements, not 16-bit elements, where the "category" byte is implicit. You've cut your data size in half, which might speed up everything else you want to do with the data in the future.

If you can't produce the source data in this format, you could use the bucketing as an opportunity to remove the tag byte as you put the payload into the right bucket, so the output is uint8_t out[4][2048];. If you're doing a SIMD left-pack with a pshufb byte-shuffle as discussed in comments, you could choose a shuffle control vector that packs only the payload bytes into the low half.

(Until AVX512BW, x86 SIMD doesn't have any variable-control word shuffles, only byte or dword, so you already needed a byte shuffle which can just as easily separate payloads from tags at the same time as packing payload bytes to the bottom.)