Search code examples
cgccssecomplex-numbersauto-vectorization

How to enable SSE3 addsubps autovectorization for complex numbers in gcc?


I have a simple loop with takes the product of n complex numbers. As I perform this loop millions of times I want it to be as fast as possible. I understand that it's possible to do this quickly using SSE3 and gcc intrinsics like _mm_addsub_ps but I'm interested in whether it's possible to get gcc to auto-vectorize code like this, a product of complex numbers:

#include <complex.h>
complex float f(complex float x[], int n ) {
  complex float p = 1.0;
  for (int i = 0; i < n; i++)
    p *= x[i];
  return p;
}

The assembly you get from gcc -S -O3 -ffast-math is:

        .file   "test.c"
        .section        .text.unlikely,"ax",@progbits
.LCOLDB2:
        .text
.LHOTB2:
        .p2align 4,,15
        .globl  f
        .type   f, @function
f:
.LFB0:
        .cfi_startproc
        testl   %esi, %esi
        jle     .L4
        leal    -1(%rsi), %eax
        pxor    %xmm2, %xmm2
        movss   .LC1(%rip), %xmm3
        leaq    8(%rdi,%rax,8), %rax
        .p2align 4,,10
        .p2align 3
.L3:
        movaps  %xmm3, %xmm5
        movaps  %xmm3, %xmm4
        movss   (%rdi), %xmm0
        addq    $8, %rdi
        movss   -4(%rdi), %xmm1
        mulss   %xmm0, %xmm5
        mulss   %xmm1, %xmm4
        cmpq    %rdi, %rax
        mulss   %xmm2, %xmm0
        mulss   %xmm2, %xmm1
        movaps  %xmm5, %xmm3
        movaps  %xmm4, %xmm2
        subss   %xmm1, %xmm3
        addss   %xmm0, %xmm2
        jne     .L3
        movaps  %xmm2, %xmm1
.L2:
        movss   %xmm3, -8(%rsp)
        movss   %xmm1, -4(%rsp)
        movq    -8(%rsp), %xmm0
        ret
.L4:
        movss   .LC1(%rip), %xmm3
        pxor    %xmm1, %xmm1
        jmp     .L2
        .cfi_endproc
.LFE0:
        .size   f, .-f
        .section        .text.unlikely
.LCOLDE2:
        .text
.LHOTE2:
        .section        .rodata.cst4,"aM",@progbits,4
        .align 4
.LC1:
        .long   1065353216
        .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
        .section        .note.GNU-stack,"",@progbits

Solution

  • The problem is that the complex type is not SIMD friendly. I have never been a fan of the complex type because it's a composite object that usually does not map to a primitive type or single operation in hardware (certainly not with x86 hardware).

    In order to make complex arithmetic SIMD friendly you need to operate on multiple complex numbers simultaneous. For SSE you need to operate on four complex numbers at once.

    We can use GCC's vector extensions to make the syntax easier.

    typedef float v4sf __attribute__ ((vector_size (16)));
    

    Then we can delcare a union of an array and the vector extension

    typedef union {
      v4sf v;
      float e[4];
    } float4
    

    And lastly we define a block of four complex numbers like this

    typedef struct {
      float4 x;
      float4 y;
    } complex4;
    

    where x is four real parts and y is four imaginary components.

    Once we have this we can multiple 4 complex numbers at once like this

    static complex4 complex4_mul(complex4 a, complex4 b) {
      return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
    }
    

    and finally we get to your function modified to operate on four complex numbers at a time.

    complex4 f4(complex4 x[], int n) {
      v4sf one = {1,1,1,1};
      complex4 p = {one,one};
      for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
      return p;
    }
    

    Let's look at the assembly (Intel syntax) to see if it's optimal

    .L3:
        movaps  xmm4, XMMWORD PTR [rsi]
        add     rsi, 32
        movaps  xmm1, XMMWORD PTR -16[rsi]
        cmp     rdx, rsi
        movaps  xmm2, xmm4
        movaps  xmm5, xmm1
        mulps   xmm1, xmm3
        mulps   xmm2, xmm3
        mulps   xmm5, xmm0
        mulps   xmm0, xmm4
        subps   xmm2, xmm5
        addps   xmm0, xmm1
        movaps  xmm3, xmm2
        jne     .L3
    

    That's exactly four 4-wide multiplications, one 4-wide addition, and one 4-wide subtraction. The variable p stays in register and only the array x is loaded from memory just like we want.

    Let's look at the algebra for the product of complex numbers

    {a, bi}*{c, di} = {(ac - bd),(bc + ad)i}
    

    That's exactly four multiplications, one addition, and one subtraction.

    As I explained in this answer efficient SIMD algebraically is often identical to the scalar arithmetic. So we have replaced four 1-wide multiplications, addition, and subtraction, with four 4-wide multiplications, addition, and subtraction. That's the best you can do with 4-wide SIMD: four for the price of one.

    Note that this does not need any instructions beyond SSE and no additional SSE instructions (except for FMA4) will be any better. So on a 64-bit system you can compile with -O3.

    It is trivial to extend this for 8-wide SIMD with AVX.

    One major advantage of using GCC's vector extensions is you get FMA without any additional effort. E.g. if you compile with -O3 -mfma4 the main loop is

    .L3:
        vmovaps xmm0, XMMWORD PTR 16[rsi]
        add     rsi, 32
        vmulps  xmm1, xmm0, xmm2
        vmulps  xmm0, xmm0, xmm3
        vfmsubps        xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
        vmovaps xmm3, xmm1
        vfmaddps        xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
        cmp     rdx, rsi
        jne     .L3