Take this code.
#include <stdlib.h>
int main(int argc , char **argv) {
int *x = malloc(argc*sizeof(int));
for (int i = 0; i < argc; ++i) {
x[i] = argc;
}
int t = 0;
for (int i = 0; i < argc; ++i) {
t += x[i];
}
free(x);
return t;
This loop
for (int i = 0; i < argc; ++i) {
x[i] = argc;
}
Vectorizes to
movdqu xmmword ptr [rax + 4*rdx], xmm0
movdqu xmmword ptr [rax + 4*rdx + 16], xmm0
movdqu xmmword ptr [rax + 4*rdx + 32], xmm0
movdqu xmmword ptr [rax + 4*rdx + 48], xmm0
movdqu xmmword ptr [rax + 4*rdx + 64], xmm0
movdqu xmmword ptr [rax + 4*rdx + 80], xmm0
movdqu xmmword ptr [rax + 4*rdx + 96], xmm0
movdqu xmmword ptr [rax + 4*rdx + 112], xmm0
movdqu xmmword ptr [rax + 4*rdx + 128], xmm0
movdqu xmmword ptr [rax + 4*rdx + 144], xmm0
movdqu xmmword ptr [rax + 4*rdx + 160], xmm0
movdqu xmmword ptr [rax + 4*rdx + 176], xmm0
movdqu xmmword ptr [rax + 4*rdx + 192], xmm0
movdqu xmmword ptr [rax + 4*rdx + 208], xmm0
movdqu xmmword ptr [rax + 4*rdx + 224], xmm0
movdqu xmmword ptr [rax + 4*rdx + 240], xmm0
https://godbolt.org/z/33vvonojd
The way I read this, it's vectorizing in memory blocks of 256 bytes. How is this possible considering that my malloc
being of size argc*sizeof(int)
is nowhere that big? Wouldn't that overwrite past the memory that I malloced?
It would if it ran, that's why it doesn't for small argc.
Note all the conditional branches before reaching that big (overly aggressively) unrolled block, specifically jmp .LBB0_12
in the fall-through path from cmp ebx, 7
/ ja .LBB0_4
.
Also note the smaller loop at .LBB0_10 that's unrolled by 2 vectors. (Seems unwise not to have a 16-byte rolled-up loop at all, only 256, 32, and scalar, but that's what clang did.)
Having some logic to run (or not) an auto-vectorized version of a loop is 100% standard and necessary when it can't be proved at compile-time that the loop will even run for one full vector. Finding the biggest block of code is all you need to do to see what's probably going to happen for large inputs, but if you want to check correctness you obviously have to consider which loops might run 0 iterations.