Load 8bit uint8_t as uint32_t?

my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.

I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.

How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this?

I mean:

alt text

I must load them as 32 bits because if you look carefully, the moment I do 255 + 255 is 512, which can't be held in a 8-bit register.

e.g.

255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels)

Solution

I will recommend that you spend a bit of time understanding how SIMD works on ARM. Look at:

Take a look at:

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/
http://blogs.arm.com/software-enablement/196-coding-for-neon-part-2-dealing-with-leftovers/
http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/
http://blogs.arm.com/software-enablement/277-coding-for-neon-part-4-shifting-left-and-right/

to get you started. You can then implement your SIMD code using inline assembler or corresponding ARM intrinsics recommended by domen.