Looking for a similar function to _mm256_load_ps but for a char pointer

Basically, I want to perform an AND mask over an array of bytes. I know the code would be something like this:

char *arr = (char*)_mm_malloc(num_bytes,8);
//fill the array with some values
__m256i mask = _mm256_set1_epi8(0x12);
for(uint32_t i=0; i<num_bytes; i+=32){
    //load for chars is unknown to me
    __m256i val = _mm256_load_char(arr+i);
    val = _mm256_and_si256 (val, mask);
    //perform extra operations with the result
}

But I don't know how to load the packet of 32 bytes safely into an 256 register.

Solution

The intrinsic for vmovdqu ymm, [mem] is _mm256_loadu_si256( (const __m256i*)any_pointer);

e.g. _mm256_loadu_si256( (const *__m256i) (arr+i) )

The aligned-load intrinsic is _mm256_load_si256();

See Intel's intrinsic finder https://software.intel.com/sites/landingpage/IntrinsicsGuide/ or other intrinsics reference where you can find stuff like this.

If you're allocating memory on the spot with _mm_malloc, ask for 32-byte alignment, not just 8, so you can use aligned loads and be guaranteed not to have any cache-line splits.

Intel's integer load/store intrinsics have silly prototypes that require casting the pointer to __m256i* even if it's not guaranteed to be properly aligned. Compilers that implement Intel's intrinsics are required to handle this without any undefined behaviour.

(In ISO C++ even creating an unaligned pointer without dereferencing it is UB.)

The AVX512 load/store intrinsics finally make this sane, taking void* so you don't need all those noisy / ugly casts.