With reference to my earlier question for border check condition - Border check in image processing? I am writing code with neon for border check.I am having below issues when writing the code :
Input :
--------------------------------
|221 220 228 223 230 233 234 235 ..
|71 73 70 78 92 130 141 143 ..
|
Requirement:
-1 * v_m1_m1 + 0 * v_m1_0 + 1 * v_m1_p1
-1 * v_0_m1 + 0 * v_0_0 + 1 * v_0_p1 --> v_out
-1 * v_p1_m1 + 0 * v_p1_0 + 1 * v_p1_p1
Pseudo Code:
for i = 0 to nrows - 1
// init row pointers
p_row_m1 = src + src_width * MAX(i-1, 0); // pointing to minus1 row
p_row_0 = src + src_width * i; // pointing to current row
p_row_p1 = src + src_width * MIN(i+1, src_width-1); // pointing to plus1 row
v_m1_m1 = vdupq_n_u32(p_row_m1[0]); // fill left vector from src[i-1][0]
v_0_m1 = vdupq_n_u32(p_row_0[0]); // fill left vector from src[i][0]
v_p1_m1 = vdupq_n_u32(p_row_p1[0]); // fill left vector from src[i+1][0]
v_m1_0 = vld1q_u32(&p_row_m1[0]); // load center vector from src[i-1][0..7]
v_0_0 = vld1q_u32(&p_row_0[0]); // load center vector from src[i][0..7]
v_p1_0 = vld1q_u32(&p_row_p1[0]); // load center vector from src[i+1][0..7]
for j = 0 to (ncols - 4) step 4 // assuming 4 elements per SIMD vector
v_m1_p1 = vld1q_u32(&p_row_m1[j+4]); // load right vector from src[i-1][0..7]
v_0_p1 = vld1q_u32(&p_row_0[j+4]); // load right vector from src[i][0..7]
v_p1_p1 = vld1q_u32(&p_row_p1[j+4]); // load right vector from src[i+1][0..7]
//
// you now have a 3x3 arrangement of vectors on which
// you can perform a neighbourhood operation and generate
// 16 output pixels for the current iteration:
//
// v_m1_m1 v_m1_0 v_m1_p1
// v_0_m1 v_0_0 v_0_p1
// v_p1_m1 v_p1_0 v_p1_p1
//
// |
// V
//
// v_out
vst1q_s32(v_out, &image_out[i][j]) // store output vector at image_out[i][j..j+15]
// shuffle vectors so that we can use them on next iteration
v_m1_m1 = v_m1_0
v_m1_0 = v_m1_p1
v_0_m1 = v_0_0
v_0_0 = v_0_p1
v_p1_m1 = v_p1_0
v_p1_0 = v_p1_p1
end_for
// for final iteration we need to handle right edge pixels...
v_m1_p1 = vdupq_n_u32(p_row_m1[ncols-1]) // fill right vector from image[i-1][ncols-1]
v_0_p1 = vdupq_n_u32(p_row_0[ncols-1]) // fill right vector from image[i][ncols-1]
v_p1_p1 = vdupq_n_u32(p_row_p1[ncols-1]) // fill right vector from image[i+1][ncols-1]
// calculate v_out as above
vst1q_s32(v_out, &image_out[i][j]) // store output vector at image_out[i][ncols_16..ncols-1]
end_for
In C code if I want to calculate for 221 (@i,j =0,0) ,With border it looks like:
221 221 220
-----------
221 |221 220
|
71 |71 73
In NEON code if I want to calculate for 221 (@i,j =0,0) ,With border it looks like:
221 221 221 221 221 220 228 223 230 233 234 235
-------------------------------
221 221 221 221 |221 220 228 223 230 233 234 235
|
71 71 71 71 |71 73 70 78 92 130 141 143
For calculating 221: In neon code neighbours are showed below and as per requirement it is going wrong:
v_m1_m1.0 v_m1_0.0 v_m1_p1.0
v_0_m1.0 v_0_0.0 v_0_p1.0
v_p1_m1.0 v_p1_0.0 v_p1_p1.0
221 221 230
---
221 |221| 230
---
71 71 92
How Can I take care of border with generic code logic ??
It looks like you need to understand how to implement a SIMD kernel with neighbourhood operations. This is the important part:
//
// you now have a 3x3 arrangement of vectors on which
// you can perform a neighbourhood operation and generate
// 4 output pixels for the current iteration:
//
// v_m1_m1 v_m1_0 v_m1_p1
// v_0_m1 v_0_0 v_0_p1
// v_p1_m1 v_p1_0 v_p1_p1
//
// |
// V
//
// v_out
Suppose you want to implement a sum over a cross-shaped kernel, i.e.:
y[i][j] = x[i][j] + x[i-1][j] + x[i+1][j] + x[i][j-1] + x[i][j+1];
The pseudo code for this in SIMD would be:
// sum vertically: x[i][j] + x[i-1][j] + x[i+1][j]
v_out = v_m1_0;
v_out = v_out + v_0_0; // vaddq
v_out = v_out + v_p1_0; // vaddq
// add the x[i][j-1] components
v_temp = v_0_m1:v_0_0 >> 1; // vextq - use this to get a right-shifted vector
v_out = v_out + v_temp; // vaddq
// add the x[i][j+1] components
v_temp = v_0_0:v_0_p1 << 1; // vextq - use this to get a left-shifted vector
v_out = v_out + v_temp; // vaddq
At this point v_out
now contains the four sums for output elements y[i][j]
..y[i][j+3]
. In other words we have evaluated four output points in one kernel. Now we shuffle all our vectors left by one, and load 3 new vectors for the right hand column, and do it all over again with j += 4
. If you look at the original pseudo-code from the previous question you will see that the border cases are all taken care of by filling a vector with the edge value.