Within a loop i have to implement a sort of clipping
if ( isLast )
{
val = ( val < 0 ) ? 0 : val;
val = ( val > 255 ) ? 255 : val;
}
However this "clipping" takes up almost half the time of execution of the loop in Neon . This is what the whole loop looks like-
for (row = 0; row < height; row++)
{
for (col = 0; col < width; col++)
{
Int sum;
//...Calculate the sum
Short val = ( sum + offset ) >> shift;
if ( isLast )
{
val = ( val < 0 ) ? 0 : val;
val = ( val > 255 ) ? 255 : val;
}
dst[col] = val;
}
}
This is how the clipping has been implemented in Neon
cmp %10,#1 //if(isLast)
bne 3f
vmov.i32 %4, d4[0] //put val in %4
cmp %4,#0 //if( val < 0 )
blt 4f
b 5f
4:
mov %4,#0
vmov.i32 d4[0],%4
5:
cmp %4,%11 //if( val > maxVal )
bgt 6f
b 3f
6:
mov %4,%11
vmov.i32 d4[0],%4
3:
This is the mapping of variables to registers-
isLast- %10
maxVal- %11
Any suggestions to make it faster ? Thanks
The clipping now looks like-
"cmp %10,#1 \n\t"//if(isLast)
"bne 3f \n\t"
"vmin.s32 d4,d4,d13 \n\t"
"vmax.s32 d4,d4,d12 \n\t"
"3: \n\t"
//d13 contains maxVal(255)
//d12 contains 0
Time consumed by this portion of the code has dropped from 223ms to 18ms
Using normal compares with NEON is almost always a bad idea because it forces the contents of a NEON register into a general purpose ARM register, and this costs lots of cycles.
You can use the vmin and vmax NEON instructions. Here is a little example that clamps an array of integers to any min/max values.
void clampArray (int minimum,
int maximum,
int * input,
int * output,
int numElements)
{
// get two NEON values with your minimum and maximum in each lane:
int32x2_t lower = vdup_n_s32 (minimum);
int32x2_t higher = vdup_n_s32 (maximum);
int i;
for (i=0; i<numElements; i+=2)
{
// load two integers
int32x2_t x = vld1_s32 (&input[i]);
// clamp against maximum:
x = vmin_s32 (x, higher);
// clamp against minimum
x = vmax_s32 (x, lower);
// store two integers
vst1_s32 (&output[i], x);
}
}
Warning: This code assumes the numElements is always a multiple of two, and I haven't tested it.
You may even make it faster if you process four elements at a time using the vminq / vmaxq instructions and load/store four integers per iteration.