Search code examples
cassemblyx86vectorizationauto-vectorization

Understanding of vectorization with SSE instructions


I try to understand how vectorization with SSE instructions works.

Here a code snippet where vectorization is achieved :

#include <stdlib.h>
#include <stdio.h>

#define SIZE 10000

void test1(double * restrict a, double * restrict b)
{
  int i;

  double *x = __builtin_assume_aligned(a, 16);
  double *y = __builtin_assume_aligned(b, 16);

  for (i = 0; i < SIZE; i++)
  {
    x[i] += y[i];
  }
}

and my compilation command :

gcc -std=c99 -c example1.c -O3 -S -o example1.s

Here the output for assembler code :

 .file "example1.c"
  .text
  .p2align 4,,15
  .globl  test1
  .type test1, @function
test1:
.LFB7:
  .cfi_startproc
  xorl  %eax, %eax
  .p2align 4,,10
  .p2align 3
.L3:
  movapd  (%rdi,%rax), %xmm0
  addpd (%rsi,%rax), %xmm0
  movapd  %xmm0, (%rdi,%rax)
  addq  $16, %rax
  cmpq  $80000, %rax
  jne .L3
  rep ret
  .cfi_endproc
.LFE7:
  .size test1, .-test1
  .ident  "GCC: (Debian 4.8.2-16) 4.8.2"
  .section  .note.GNU-stack,"",@progbits

I have practiced Assembler many years ago and I would like to know what represents above the registers %rdi, %rax and %rsi.

I know %xmm0 is the SIMD register where we can store 2 doubles (on 16 bytes).

But I don't understand how the simultaneous addition is performed :

I think all happens here :

      movapd  (%rdi,%rax), %xmm0
      addpd (%rsi,%rax), %xmm0
      movapd  %xmm0, (%rdi,%rax)
      addq  $16, %rax
      cmpq  $80000, %rax
      jne .L3
      rep ret

Does %rax represents "x" array ?

What does %rsi represent in C code snippet ?

Does the final result (for example a[0]=a[0]+b[0] is stored into %rdi ?

Thanks for your help


Solution

  • The first thing you need to know is the calling conventions for 64-bit code on Unix systems. See Wikipedia's x86-64_calling_conventions and for much more detail read Agner Fog's calling conventions manual.

    Integer parameters are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. So you can pass up six integer values by register (but only four on Windows). This means in your case that:

    rdi = &x[0],
    rsi = &y[0].
    

    The register rax starts at zero and increments 2*sizeof(double)=16 bytes each iteration. It is then compared with sizeof(double)*10000=80000 each iteration to test if the loop is finished.

    The use of cmp here is actually an inefficiency in the GCC compiler. Modern Intel processors can fuse the cmp and jne instruction into one instruction and they can also fuse add and jne into one instruction but they cannot fuse add, cmp, and jne into one instruction. But it's possible to remove the cmp instruction.

    What GCC should have done is set

    rdi = &x[0] + 80000;
    rsi = &y[0] + 80000;
    rax = -80000
    

    and then the loop could be done like this

    movapd  (%rdi,%rax), %xmm0       ; temp = x[i]
    addpd (%rsi,%rax), %xmm0         ; temp += y[i]
    movapd  %xmm0, (%rdi,%rax)       ; x[i] = temp
    addq  $16, %rax                  ; i += 2
    jnz .L3                          ; then loop
    

    Now the loop counts from -80000 up to 0 and does not need the cmp instruction and the add and jnz will be fused into one micro-operation.