Aggregate sum for set bits in NEON SIMD

I have an algorithm that operates on a large array of bytes. As a preprocessing step, I need to create for a given index a count of which bit is how often set in the array up to this point.

I can do this in C using the following (pseudo) code:

input: uint8_t values[COUNT];
output: uint32_t bitsum[COUNT+1][8];
       (bitsum[i][x] is the counter for the x-th bit being set in
        the PRECEEDING i bytes -- this makes bitsum[0][x] == 0)

// we skip first row
for (int i=1; i < COUNT+1; i++) {
   for (int bit=0; bit < 8; bit++) {
      bitsum[i][bit] = bitsum[i-1][bit];
      if (values[i-1] & (1 << bit) != 0) {
         bitsum[i][bit]++;
      }
   }
}

However I am hopign that I can achieve this faster using NEON SIMD. Unfortunately I am quite new to this so I have not been able to solve this (yet?) and looking for some help. Is it even possible to do this in NEON?

UPDATE:

Trying to speed this code up in C, I believe the following approach is the fastest (sans unrolling the inner for loop, of course):

// pre-calculate lookup-table
uint16_t lookup[256][8];
for (int value=0; value < 256; value++) {
   for (int bit=0; bit < 8; bit++) {
      if (value & (1 << bit) != 0) {
         lookup[value][bit]++;
      }
   }
}

// create sum
for (int i=1; i < COUNT+1; i++) {
   for (int bit=0; bit < 8; bit++) {
      bitsum[i][bit] = bitsum[i-1][bit] + lookup[values[i-1]][bit];
   }
}

This looks like it would be ideal for SIMD, except for the lookup-table access - at least I can not find a way to do that in NEON.

Solution

You can do table lookups in NEON using the VTBL and VTBX instructions, but they are only useful for look up tables with few entries. When optimising for NEON it is often best to look for a way to calculate values at run time instead of using a table.

In this example it is straightforward to calculate the lookup at run time. The function is essentially

int lookup(int val, int bit) { return (val & (1<<bit) >> bit); }

which can be converted to NEON SIMD easily.

So, your function can be implemented with NEON intrinsics like this:

#include <arm_neon.h>

void f(uint32_t *output, const uint8_t *input, int length)
{   

    static const uint8_t mask_vals[] = {  0x1,  0x2,  0x4,  0x8,
                                         0x10, 0x20, 0x40, 0x80 };
    /* NEON shifts are left shifts, and we want a right shift,
       so use negative numbers here */
    static const int8_t shift_vals[] = { 0, -1, -2, -3, -4, -5, -6, -7 };

    /* constants we need in the main loop */
    uint8x8_t mask    = vld1_u8(mask_vals);
    int8x8_t shift    = vld1_s8(shift_vals);

    /* accumulators for results, bits 0-3 in cumul1, bits 4-7 in cumul2 */
    uint32x4_t cumul1 = vdupq_n_u32(0);
    uint32x4_t cumul2 = vdupq_n_u32(0);

    for (int i = 0; i < length; i++)
    {   
        uint8x8_t v = vld1_dup_u8(input+i);
        /* this gives 0 or 1 in each lane, depending on whether the
           appropriate bit is set */
        uint8x8_t incr = vshl_u8(vand_u8(v, mask), shift);

        /* widen to 16 bits */
        uint16x8_t incr_w = vmovl_u8(incr);

        /* increment the accumulators */
        cumul1 = vaddw_u16(cumul1, vget_low_u16(incr_w));
        cumul2 = vaddw_u16(cumul2, vget_high_u16(incr_w));
        /* store the accumulator values */
        vst1q_u32(output + i*8, cumul1);
        vst1q_u32(output + i*8 + 4, cumul2);
    }
}

Disclaimer: This code compiles, but I haven't tested it.