I am trying to vectorize some simple calculations for speed up from SIMD architecture. However, I also want to put them as inline function because function calls and non-vectorized codes also take computation time. However, I cannot always achieve them at the same time. In fact, most of my inline functions fail to get auto-vectorized. Here is a simple test code that works:
inline void add1(double *v, int Length) {
for(int i=0; i < Length; i++) v[i] += 1;
}
void call_add1(double v[], int L) {
add1(v, L);
}
int main(){return 0;}
On Mac OS X 10.12.3, compile it:
clang++ -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -std=c++11 -ffast-math test.cpp
test.cpp:2:5: remark: vectorized loop (vectorization width: 2, interleaved count: 2) [-Rpass=loop-vectorize]
for(int i=0; i < Length; i++) v[i] += 1;
^
However, Something very similar (only moving arguments in call_add1) does not work:
inline void add1(double *v, int Length) {
for(int i=0; i < Length; i++) v[i] += 1;
}
void call_add1() {
double v[20]={0,1,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9};
int L=20;
add1(v, L);
}
int main(){ return 0;}
Compiling with the same command produces no output. Why does this happen? How can I make sure that loops in inline functions always get auto-vectorized? I want to vectorize many function loops, so I hope the fix would not be to complex.
Compiling your code with -fsave-optimization-record
shows that the loop was unrolled and then eliminated.
--- !Passed
Pass: loop-unroll
Name: FullyUnrolled
DebugLoc: { File: main.cpp, Line: 2, Column: 5 }
Function: _Z9call_add1v
Args:
- String: 'completely unrolled loop with '
- UnrollCount: '20'
- String: ' iterations'
...
--- !Passed
Pass: gvn
Name: LoadElim
DebugLoc: { File: main.cpp, Line: 2, Column: 40 }
Function: _Z9call_add1v
Args:
- String: 'load of type '
- Type: double
- String: ' eliminated'
- String: ' in favor of '
- InfavorOfValue: '0.000000e+00'
If you put 4000 elements to the array, it will exceed optimizer threshold and clang will enable vectorization.