I'm working on some image processing app for an iOS and thresholding is really a huge bottleneck. So I'm trying to optimize it using NEON. Here is the C version of function. Is there any way to rewrite this using NEON (unfortunately I have absolutely no experience in this) ?
static void thresh_8u( const Image& _src, Image& _dst, uchar thresh, uchar maxval, int type ) {
int i, j;
uchar tab[256];
Size roi = _src.size();
roi.width *= _src.channels();
memset(&tab[0], 0, thresh);
memset(&tab[thresh], maxval, 256-thresh);
for( i = 0; i < roi.height; i++ ) {
const uchar* src = (const uchar*)(_src.data + _src.step*i);
uchar* dst = (uchar*)(_dst.data + _dst.step*i);
j = 0;
for(; j <= roi.width; ++j) {
dst[j] = tab[src[j]];
}
}
}
It's actually pretty easy, if you can make sure your rows are always a multiple of 16 bytes wide, because the compiler (clang) has special types representing the NEON vector registers, and knows how to apply the normal C operators to them. Here's my little test function:
#ifdef __ARM_NEON__
#include <arm_neon.h>
void computeThreshold(void *input, void *output, int count, uint8_t threshold, uint8_t highValue) {
uint8x16_t thresholdVector = vdupq_n_u8(threshold);
uint8x16_t highValueVector = vdupq_n_u8(highValue);
uint8x16_t *__restrict inputVector = (uint8x16_t *)input;
uint8x16_t *__restrict outputVector = (uint8x16_t *)output;
for ( ; count > 0; count -= 16, ++inputVector, ++outputVector) {
*outputVector = (*inputVector > thresholdVector) & highValueVector;
}
}
#endif
This operates on 16 bytes at a time. A uint8x16_t
is a vector register containing 16 8-bit unsigned ints. The vdupq_n_u8
returns a vector uint8x16_t
filled with 16 copies of its argument.
The >
operator, applied to two uint8x16_t
values, does 16 comparisons between pairs of 8-bit unsigned ints. Where the left input is greater than the right input, it returns 0xff (which is different from a normal C >
, which just returns 0x01). Where the left input is less than or equal to the right input, it returns 0. (It compiles into the VCGT.U8 instruction.)
The &
operator, applied to two uint8x16_t
values, computes the boolean AND of 128 pairs of bits.
The loop compiles to this in a release build:
0x6e668: vldmia r2, {d4, d5}
0x6e66c: subs r0, #16
0x6e66e: vcgt.u8 q10, q10, q8
0x6e672: adds r2, #16
0x6e674: cmp r0, #0
0x6e676: vand q10, q10, q9
0x6e67a: vstmia r1, {d4, d5}
0x6e67e: add.w r1, r1, #16
0x6e682: bgt 0x6e668