Got one uint8x8_t, eg. [100, 100, 100, 100, 200, 200, 200, 200]
How can that uint8x8_t above be stored on ONE uint8x8x4_t WITH one instruction / intrinsics ?
At the moment, we use
uint8x8x4_t.val[0] = uint8x8_t;
uint8x8x4_t.val[1] = uint8x8_t;
uint8x8x4_t.val[2] = uint8x8_t;
uint8x8x4_t.val[3] = uint8x8_t;
// typedef struct uint8x8x4_t {
// uint8x8_t val[4];
// } uint8x8x4_t;
I don't think there is a single instruction which does this for NEON, unless you replicate the input data and then just use a single vld4 ()
.
I've not tested it, but my gut feel is that replication is probably not going to be an overall saving as I doubt many CPU caches are going to sustain 64 bytes per clock, and the moves to replicate the copies in registers should be efficient.