Basically, I want to perform an AND mask over an array of bytes. I know the code would be something like this:
char *arr = (char*)_mm_malloc(num_bytes,8);
//fill the array with some values
__m256i mask = _mm256_set1_epi8(0x12);
for(uint32_t i=0; i<num_bytes; i+=32){
//load for chars is unknown to me
__m256i val = _mm256_load_char(arr+i);
val = _mm256_and_si256 (val, mask);
//perform extra operations with the result
}
But I don't know how to load the packet of 32 bytes safely into an 256 register.
The intrinsic for vmovdqu ymm, [mem]
is _mm256_loadu_si256( (const __m256i*)any_pointer);
e.g. _mm256_loadu_si256( (const *__m256i) (arr+i) )
The aligned-load intrinsic is _mm256_load_si256();
See Intel's intrinsic finder https://software.intel.com/sites/landingpage/IntrinsicsGuide/ or other intrinsics reference where you can find stuff like this.
If you're allocating memory on the spot with _mm_malloc
, ask for 32-byte alignment, not just 8, so you can use aligned loads and be guaranteed not to have any cache-line splits.
Intel's integer load/store intrinsics have silly prototypes that require casting the pointer to __m256i*
even if it's not guaranteed to be properly aligned. Compilers that implement Intel's intrinsics are required to handle this without any undefined behaviour.
(In ISO C++ even creating an unaligned pointer without dereferencing it is UB.)
The AVX512 load/store intrinsics finally make this sane, taking void*
so you don't need all those noisy / ugly casts.