vector instructions ("vcl" and "ume") for counting sort

I'm trying vector instruction using libraries "vcl" and "ume" for a kind of counting sort, which gives only the position back

// icpc sort.cpp -xCORE_AVX2 -o c
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include "vcl/vectorclass.h"
#include "umesimd/UMESimd.h"
using namespace UME::SIMD;

int main(void) {
    //scalar version as example __________________________________________
    int s0[4], k0 = 0, x0[4] = { 9,
                                 1,
                                 2,
                                 1};
    for (int i = 9; i >= 0; i--)
        for (int j = 0; j < 4; j++)
            if (x0[j] == i) {
                s0[k0] = j;
                k0++; }
    for (int j = 0; j < 4; j++) printf("%i\n", s0[j]); printf("\n");

    int a[32] = {9, 1,  5,  2,  1,  6,  3,  5,
                1,  4,  0,  3,  4,  7,  1,  1,
                2,  2,  1,  4,  4,  7,  2,  5,
                1,  4,  0,  1,  6,  4,  6,  5};
    //vcl version _____________________________________________________________________
    Vec8i s1[4] = 0, k1 = 0, x1[4]; Vec8ib msk1;
    x1[0].load(a);  x1[1].load(a + 8);  x1[2].load(a + 16); x1[3].load(a + 24);
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", x1[i].extract(j)); printf("\n"); } printf("\n");
    for (int i = 9; i >= 0; i--)
        for (int j = 0; j < 4; j++) {
            msk1 = (x1[j] == i);
            //s1[k1] = select(msk1, j, s1);
            k1 = select(msk1, k1 + 1, k1); }
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", s1[i].extract(j)); printf("\n"); } printf("\n");

    //ume version ______________________________________________________________________
    SIMD8_32i s2[4] = 0, k2 = 0, x2[4]; SIMDMask8 msk2;
    x2[0].load(a); x2[1].load(a + 8); x2[2].load(a + 16); x2[3].load(a + 24);
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", x2[i].extract(j)); printf("\n"); } printf("\n");
    for (int i = 9; i >= 0; i--)
        for (int j = 0; j < 4; j++) {
            msk2 = x2[j].cmpeq(i);
            //s2[k2].assign(msk2, j);
            k2.assign(msk2, k2 + 1); }
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", s2[i].extract(j)); printf("\n"); } printf("\n");

    return 0;
}

unfortunately I cannot find the solution for any of the libraries and I need help to solve it (commented lines)

Solution

The hard part of Counting Sort is the histogram problem.

Histograms are hard for SIMD because you require gather loads and scatter stores, and detection of conflicts (where your vector of input elements has duplicates, so you need to increment the same bucket multiple times). x86 only has this with AVX512.

Without gather/scatter/conflict-detection, there are scalar tricks that are useful for small numbers of buckets. e.g. replicating the count array 4 times to mitigate store/reload store-forwarding bottlenecks if the same input number appears makes up most of the input for a while. (With out of order execution, it doesn't have to be 100% contiguous to cause a problem.)

You can use SIMD to sum up the bucket arrays at the end, e.g. _mm256_add_epi32 for 32-bit counters. There's some scope for manual SIMD optimization of the memset / std::fill part of producing the sorted output, but usually this isn't the slow part.