I try to optimize some image processing algorithms for ARM with using NEON intrinsics.
For some filters it need to load elements in the neighborhood of the point.
For example to filter an image in pixel p[x]
I need to load p[x - 1]
, p[x]
and p[x + 1]
.
If x=0
, then I load p[0]
, p[0]
and p[1]
. If x=width-1
, then I load p[width-2]
, p[width-1]
and p[width-1]
.
So if I have a vector:
uint8x16_t a = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
How can I get from it following vectors:
uint8x16_t b = {0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14};
uint8x16_t c = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15};
I think that following functions will be useful for your case:
template <size_t count> inline uint8x16_t LoadBeforeFirst(uint8x16_t first)
{
return vextq_u8(vextq_u8(first, first, count), first, 16 - count);
}
template <size_t count> inline uint8x16_t LoadAfterLast(uint8x16_t last)
{
return vextq_u8(last, vextq_u8(last, last, 16 - count), count);
}