Efficiently count number of distinct values in 16-byte buffer in arm neon

Here's the basic algorithm to count number of distinct values in a buffer:

unsigned getCount(const uint8_t data[16])
{
    uint8_t pop[256] = { 0 };
    unsigned count = 0;
    for (int i = 0; i < 16; ++i)
    {
        uint8_t b = data[i];
        if (0 == pop[b])
            count++;
        pop[b]++;
    }
    return count;
}

Can this be done somehow in neon efficiently by loading into a q-reg and doing some bit magic? Alternatively, can I efficiently say that data has all elements identical, or contains only two distinct values or more than two?

For example, using vminv_u8 and vmaxv_u8 I can find min and max elements and if they are equal I know that data has identical elements. If not, then I can vceq_u8 with min value and vceq_u8 with max value and then vorr_u8 these results and compare that I have all 1-s in the result. Basically, in neon it can be done this way. Any ideas how to make it better?

unsigned getCountNeon(const uint8_t data[16])
{
    uint8x16_t s = vld1q_u8(data);
    uint8x16_t smin = vdupq_n_u8(vminvq_u8(s));
    uint8x16_t smax = vdupq_n_u8(vmaxvq_u8(s));
    uint8x16_t res = vdupq_n_u8(1);
    uint8x16_t one = vdupq_n_u8(1);

    for (int i = 0; i < 14; ++i) // this obviously needs to be unrolled
    {
        s = vbslq_u8(vceqq_u8(s, smax), smin, s); // replace max with min
        uint8x16_t smax1 = vdupq_n_u8(vmaxvq_u8(s));
        res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax1, smax), one));
        smax = smax1;
    }
    res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax, smin), one));
    return vgetq_lane_u8(res, 0);
}

With some optimizations and improvements perhaps a 16-byte block can be processed in 32-48 neon instructions. Can this be done better in arm? Unlikely

Some background why I ask this question. As I'm working on an algorithm I'm trying different approaches at processing data and I'm not sure yet what exactly I'll use at the end. Information that might be of use:

count of distinct elements per 16-byte block
value that repeats most per 16-byte block
average per block
median per block
speed of light?.. that's a joke, it cannot be computed in neon from 16-byte block :)

so, I'm trying stuff, and before I use any approach I want to see if that approach can be well optimized. For example, average per block will be memcpy speed on arm64 basically.

Solution

If you're expecting a lot of duplicates, and can efficiently get a horizontal min with vminv_u8, this might be better than scalar. Or not, maybe NEON->ARM stalls for the loop condition kill it. >.< But it should be possible to mitigate that with unrolling (and saving some info in registers to figure out how far you overshot).

// pseudo-code because I'm too lazy to look up ARM SIMD intrinsics, edit welcome
// But I *think* ARM can do these things efficiently, 
// except perhaps the loop condition.  High latency could be ok, but stalling isn't

int count_dups(uint8x16_t v)
{
    int dups = (0xFF == vmax_u8(v));   // count=1 if any elements are 0xFF to start
    auto hmin = vmin_u8(v);

    while (hmin != 0xff) {
        auto min_bcast = vdup(hmin);  // broadcast the minimum
        auto matches = cmpeq(v, min_bcast);
        v |= matches;                 // min and its dups become 0xFF
        hmin = vmin_u8(v);
        dups++;
    }
    return dups;
}

This turns unique values into 0xFF, one set of duplicates at a time.

The loop-carried dep chain through v / hmin stays in vector registers; it's only the loop branch that needs NEON->integer.

Minimizing / hiding NEON->integer/ARM penalties

Unroll by 8 with no branches on hmin, leaving results in 8 NEON registers. Then transfer those 8 values; back-to-back transfers of multiple NEON registers to ARM only incurs one total stall (of 14 cycles on whatever Jake tested on.) Out-of-order execution could also hide some of the penalty for this stall. Then check those 8 integer registers with a fully-unrolled integer loop.

Tune the unroll factor to be large enough that you usually don't need another round of SIMD operations for most input vectors. If almost all of your vectors have at most 5 unique values, then unroll by 5 instead of 8.

Instead of transferring multiple hmin results to integer, count them in NEON. If you can use ARM32 NEON partial-register tricks to put multiple hmin values in the same vector for free, it's only a bit more work to shuffle 8 of them into one vector and compare for not-equal to 0xFF. Then horizontally add that compare result to get a -count.

Or if you have values from different input vectors in different elements of a single vector, you can use vertical operations to add results for multiple input vectors at once without needing horizontal ops.

There's almost certainly room to optimize this, but I don't know ARM that well, or ARM performance details. NEON's hard to use for anything conditional because of the big performance penalty for NEON->integer, totally unlike x86. Glibc has a NEON memchr with NEON->integer in the loop, but I don't know if it uses it or if it's faster than scalar.

Speeding up repeated calls to the scalar ARM version:

Zeroing the 256-byte buffer every time would be expensive, but we don't need to do that. Use a sequence number to avoid needing to reset:

Before every new set of elements: ++seq;

For each element in the set:

sum += (histogram[i] == seq);
histogram[i] = seq;     // no data dependency on the load result, unlike ++

You might make the histogram an array of uint16_t or uint32_t to avoid needing to re-zero if a uint8_t seq wraps. But then it takes more cache footprint, so maybe just re-zeroing every 254 sequence numbers makes the most sense.