optimization gcc loops vector-processing

Auto-vectorizing vs. vectorized code by hand

Is it better in some sense to vectorize code by hand, using explicit pragmas or to rely on or use auto-vectorization? For optimum performance using auto-vectorization, one would have to monitor the compiler output to ensure that loops are being vectorized or modify them until they are vectorizable.

With hand coding, one is certain that the desired instructions are being emitted, but now the code is likely not portable (either to other architectures or other compilers).

Solution

Auto vectorization never worked out well for me. To me it seems like auto-vectorization only works for very trivial loops at the moment.

I use the pragma/intrinsic approach and take a look at the assembly. If the compiler generates bad code (like spilling SSE registes onto the stack or adding redundant moves) I use inline assembler for the whole loop body.

Portability is btw not a problem. Often you start with a C/C++ loop and optimize it using intrinsics. Just keep the old loop and use it as a unit-test / fallback for your SIMD implementation. Also it's always wise to be able to remove all SIMD code from a project via a compile-time define. Debugging an application is much easier that way. The same define can be used for cross-compilation.