Search code examples
assemblyx86vectorizationsimd

Why is my %xmm3 register using the first argument in vbroadcastsd and not the fourth?


I am trying to implement a function in assembly that does some basic calculations using SIMD vector instructions and registers. The function signature is void map_poly_double_vec(double* input, double* output, uint64_t length, double a, double b, double c, double d);

For some reason, when I use vbroadcastsd %xmm3, %ymm6 to put the double d argument into all of the fields of %ymm6, the program instead inserts the double a argument into it instead. The other vbroadcastsd instructions work fine except for the last one. I've used GDB to try and figure out why but the instruction simply runs and uses the first double argument instead of the fourth.

Here is my assembly function (AT&T syntax):

map_poly_double_vec:
    mov $0, %rcx

    vbroadcastsd %xmm0, %ymm3 #a
    vbroadcastsd %xmm1, %ymm4 #b
    vbroadcastsd %xmm2, %ymm5 #c
    vbroadcastsd %xmm3, %ymm6 #d

mpdv_loop:
    cmp %rdx, %rcx
    je mpdv_end

    vmovupd (%rdi, %rcx, 8), %ymm0
    vmovupd %ymm0, %ymm1

    vfmadd132pd %ymm3, %ymm4, %ymm1
    vfmadd132pd %ymm0, %ymm5, %ymm1
    vfmadd132pd %ymm1, %ymm6, %ymm0

    vmovupd %ymm0, (%rsi, %rcx, 8)

    add $4, %rcx
    jmp mpdv_loop

mpdv_end:
    ret

This is what I am using to test the function.

#include <assert.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include "lab11.h"

double* create_array(uint64_t length) {
    double* array = (double*)malloc(length * sizeof(double));
    if (array == NULL) {
        return NULL;
    }
    for (uint64_t i = 0; i < length; i++) {
        array[i] = ((double)rand() / RAND_MAX - 0.25);
    }
    return array;
}

void print_double_array(double* array, uint64_t length) {
    printf("{ ");
    for (uint64_t i = 0; i < length; i++) {
        printf("%.6g ", array[i]);
    }
    printf("}\n");
}

int main(void) {
    uint64_t length = 16;
    double* doubles1 = create_array(length);
    double* double_out = (double*)malloc(length * sizeof(double));

    printf("map_poly_double_vec result:\n");
    memset(double_out, 0, length * sizeof(double));
    map_poly_double_vec(doubles1, double_out, length, 4, 5, 6, 7);
    print_double_array(double_out, length);

    free(doubles1);
    free(double_out);
    return 0;
}

What I should get:

{ 13.105 7.98257 12.2256 12.4544 14.3174 6.69849 7.55013 12.0089 7.17059 9.39815 8.66996 10.2085 7.76063 9.0004 15.0642 14.3989 }

What I get with my function:

map_poly_double_vec result:
{ 10.105 4.98257 9.22559 9.45443 11.3174 3.69849 4.55013 9.00889 4.17059 6.39815 5.66996 7.20848 4.76063 6.0004 12.0642 11.3989 }

Solution

  • See here in the first broadcast:

    vbroadcastsd %xmm0, %ymm3 #a   <----
    vbroadcastsd %xmm1, %ymm4 #b
    vbroadcastsd %xmm2, %ymm5 #c
    vbroadcastsd %xmm3, %ymm6 #d
    

    Now ymm3, which used to hold argument d, has been overwritten with argument a (broadcasted).

    So vbroadcastsd %xmm3, %ymm6 picks up argument a again.

    In case this was the source of confusion: the xmm registers are extended to ymm registers in AVX, the ymm registers not an independent new set of registers. Every 256-bit ymm register can be seen as two chunks of 128 bits, the bottom 128 bits are the corresponding xmm register, eg xmm3 is the bottom 128 bits of ymm3.