c compiler-construction clang simd auto-vectorization

How does this vectorized code not overwrite memory?

Take this code.

#include <stdlib.h>

int main(int argc , char **argv) {
    int *x = malloc(argc*sizeof(int));
    
    for (int i = 0; i < argc; ++i) {
        x[i] = argc;
    }

    int t = 0;
    for (int i = 0; i < argc; ++i) {
        t += x[i];
    }

    free(x);
    return t;

This loop

for (int i = 0; i < argc; ++i) {
        x[i] = argc;
}

Vectorizes to

        movdqu  xmmword ptr [rax + 4*rdx], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 16], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 32], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 48], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 64], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 80], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 96], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 112], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 128], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 144], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 160], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 176], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 192], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 208], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 224], xmm0
        movdqu  xmmword ptr [rax + 4*rdx + 240], xmm0

https://godbolt.org/z/33vvonojd

The way I read this, it's vectorizing in memory blocks of 256 bytes. How is this possible considering that my malloc being of size argc*sizeof(int) is nowhere that big? Wouldn't that overwrite past the memory that I malloced?

Solution

It would if it ran, that's why it doesn't for small argc.

Note all the conditional branches before reaching that big (overly aggressively) unrolled block, specifically jmp .LBB0_12 in the fall-through path from cmp ebx, 7 / ja .LBB0_4.

Also note the smaller loop at .LBB0_10 that's unrolled by 2 vectors. (Seems unwise not to have a 16-byte rolled-up loop at all, only 256, 32, and scalar, but that's what clang did.)

Having some logic to run (or not) an auto-vectorized version of a loop is 100% standard and necessary when it can't be proved at compile-time that the loop will even run for one full vector. Finding the biggest block of code is all you need to do to see what's probably going to happen for large inputs, but if you want to check correctness you obviously have to consider which loops might run 0 iterations.