How effectively load the vertical line datas from memory into neon registers

I want to read a vertical line of data from an image block, ie: I want to get the first data of every line (line length equal the block width).

I think the following code is not good. Is there a better implementation? (the data address in r5, the line length in r1)

vld1.u8     d3[0],  [r5],   r1
vld1.u8     d3[1],  [r5],   r1
vld1.u8     d3[2],  [r5],   r1
vld1.u8     d3[3],  [r5],   r1
vld1.u8     d3[4],  [r5],   r1
vld1.u8     d3[5],  [r5],   r1
vld1.u8     d3[6],  [r5],   r1
vld1.u8     d3[7],  [r5],   r1
vld1.u8     d4[0],  [r5],   r1
vld1.u8     d5[0],  [r5],   r1
vld1.u8     d5[1],  [r5],   r1    
vld1.u8     d5[2],  [r5],   r1
vld1.u8     d5[3],  [r5],   r1   
vld1.u8     d5[4],  [r5],   r1
vld1.u8     d5[5],  [r5],   r1    
vld1.u8     d5[6],  [r5],   r1
vld1.u8     d5[7],  [r5],   r1

Solution

NEON only directly support non-continous loads for strides up to 4 (Via the VLDn instructions where n is the stride size). Since you're line length is presumably much larger than that, I don't see a way to do what you want apart from loading each element individually like your code code does.

However, if you need to apply this post-processing step not only to the first column, but to all columns, then you could process 8 (or 16, if you use Q registers) columns at once, instead of processing them individually. How feasable that is depends on your algorithm, of course.

Ideally, you'd crank up the chunk size even further, and process ss many columns at once as fit into one cache line (64 on most ARMs, if your element size is 8 bit). Otherwise, if your image has lots of rows, the cache lines containing the first rows will have been removed from the cache by the time you've processed the last ones, and they'll have to be re-fetched to process the next chunk of columns.