I want to load __m256
directly from Armadillo vector data with .memptr()
.
Does Armadillo ensure the data memory is 256-bits aligned? If it is then I would just convert the float/double pointer returned by .memptr()
to __m256
pointer and skip the _mm256_load_ps()
, if it makes sense in terms of performance.
The Armadillo do not seems to talk about this point in the documentation so it is left unspecified. Thus, vector data are likely not ensured to be 32-bytes aligned.
However, you do not need vector data to be aligned to load them in AVX registers: you can use the unaligned load intrinsic _mm256_loadu_ps
. AFAIK, the performance of _mm256_load_ps
and _mm256_loadu_ps
is about the same on relatively-new x86 processors.