Search code examples
c++openmpvectorizationauto-vectorization

How to determine vector length to ensure no vector dependency during vectorization?


For cases like this example

for (int i = 16; i < n; i++) 
    a[i] += a[i-16];

How do I determine the vector length to be sure this loop can be vectorized? Is the following the correct method?

// Determine target cpu architecture's vector register bit-size
// E.g., Intel AVX-512 has 512-bit vector registers
int register_size = 512

// Modern machines are 8 bits
int byte_size = 8

// Determine array type size
int my_array[n];
auto array_type_size = sizeof(int) * byte_size   // 4 bytes * 8 bits = 32 bits

// Divide register size by array type size
auto vector_length = register_size / array_type_size    // = 16

This would mean that vectorizing arrays of type int on an Intel AVX-512 would result in vectors in the register of length 16, making the above example safe to vectorize.

Is this method correct? If so, is there a way to use shorter vectors on this architecture? e.g., force a vector of length 4 so the below example can be vectorized

for (int i = 4; i < n; i++) 
    a[i] += a[i-4];

Solution

  • Intel provides a range of SIMD functionality. You can use from the 64-bit mmx registers to the 512 zmm registers. You can also use lower halves of these registers just fine: you can use 256-bit ymm, or even the aliased lower half of that and use the 128-bit xmm. Your compiler will, on release mode, vectorise a loop as much as it can regardless of your loop range or size. If you want to do this yourself you can edit compiler flags or write your own assembly for a small but noticeable speed increase. To “force” the compiler to vectorise a certain way would be to change compiler flags to disallow use of AVX-512 or maybe use some preprocessor directives that your compiler will have in its documentation. There is no inherent problem with not using the whole register with vectorisation.