SIMD instructions on contiguous iterators

I have two vectors v1 and v2 of type T and want to create a function that performs v1 & v2 using SIMD instructions and stores the output in a vector out.

Ideally, what we would have is

first1  = v1.begin();
last1   = v1.end();
first2  = v2.begin();
d_first = out.begin();
while(distance(first1, last1) >= 64 / sizeof(T)) {
     *d_first = _mm512_and_epi32(first1, first2);
     first1   += 64 / sizeof(T)
     first2   += 64 / sizeof(T)
     d_first1 += 64 / sizeof(T)
}
auto and_op = [](T a, T b) {return a & b;};
std::transform(first1, last1, first2, d_first, and_op);

The first issue w/ the code above is that it works with 32-bit integers. I'm not sure if it expects these to be aligned, and if it does, then the code wouldn't work if T was something like char or short int.

The second issue is I can't get the vector iterators to cast correctly. _mm512_and_epi32 expects two __m512i variables as input. Whenever I pass a contiguous iterator or an address, the compiler always complains saying there's no conversion from what I pass to "'__m512i' (vector of 8 'long long' values)"

I am able to get it to work by doing

__m512i _a = _mm512_load_epi64(&*first1.base());
__m512i _b = _mm512_load_epi64(&*first2.base());'
__m512i _res = _mm512_and_epi64(_a, _b);
_mm512_store_epi64(&*d_first.base(), _res);

But I'm not sure how costly the load/store operations are or whether or not I can skip them.

What is the proper way to run SIMD instructions on large contiguous arrays? Is there a way to make it work for contiguous arrays of all types, regardless of their alignment?

Solution

Normally you just get a pointer from .data() on the container and loop manually over the array, like a C-style array. Or increment an index and do _mm512_loadu_si512(&vec[i]). (Unless you used a custom aligned allocator for your std::vector, you shouldn't assume that the data is aligned. But 512-bit vectors on current HW benefit significantly from making sure data is aligned, like maybe 20% vs. a couple % with 256-bit vectors.)

Your dereferenced iterator way might be safe if there's a guarantee that it's a reference to the underlying array element, not a scalar temporary.

Load/store intrinsics aren't any more costly than implicit loads from memory via dereferencing something; you need to think from an asm perspective to understand the costs. The compiler has to emit vector load instructions (or a memory source operand for an ALU instruction), and store instructions, to make asm that operates on data in memory. _mm_load_si128 vs. _mm_loadu_si128 basically just exists to communicate alignment information to the compiler and to cast. And to express the strict-aliasing and alignment safe access to other C types, like memcpy.