assembly x86-64 nasm memory-alignment avx

YASM: vmovaps instruction causing segmentation fault

Problem: movaps is giving me a segmentation fault.

Context: The x86-64 instruction vmovaps is designed to be used with the AVX registers on a Core i series processor (which I am running this system with). The AVX registers are twice as wide as the SSE ones (256 vs 128 bits respectively). Instruction vmovaps should move a vector of aligned floating-point values (32-bits) into the specified ymm register.

Likely Cause: The alignment of the source data is of particular importance, as incorrectly aligned data is a source for segmentation faults. However, even when I have aligned my data, I am encountering a segmentation fault myself.

Example

    segment .data

align 16
xs:
    dd  0.0
    dd  1.1
    dd  2.2
    dd  3.3
    dd  4.4
    dd  5.5
    dd  6.6
    dd  7.7

align 16
ys:
    dd  8.8
    dd  7.7
    dd  6.6
    dd  5.5
    dd  4.4
    dd  3.3
    dd  2.2
    dd  1.1

    segment .text
    global main

main:
    push rbp
    mov rbp, rsp

    ; Move eight 32-bit floats from "xs" into ymm0
    vmovaps ymm0, [xs]

    ; Move eight 32-bit floats from "ys" into ymm1
    vmovaps ymm1, [ys]

    ; Add all eight to each other simulatenously, put in ymm0
    vaddps ymm0, ymm1

    xor rax, rax
    leave
    ret

Compiled with: yasm -f elf64 -g dwarf2 <filename>

Linked with: gcc -o <bin-name> <filename>.o

When I run this with GDB, it simply reports it received a segmentation fault signal on the first vmovaps instruction. I have checked documentation on alignment and I think it is all correct. For what its worth, I am running and executing this on a i5 8600K.

I've also looked at this similar question. However I can't really apply the answer to his problem to mine (something to do with his inline assembly). If anyone could weight in on this I'd be grateful!

Solution

vmovaps with ymm0 operand requires 32 byte alignment. To quote the manual:

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection exception (#GP) will be generated. For EVEX.512 encoded versions, the operand must be aligned to the size of the memory operand.

(emphasis added). Linux delivers SIGSEGV to processes that cause a #GP exception.

Thus, you should change align 16 to align 32 for your static array of dd elements

Or use vmovups unaligned loads, and let the hardware handle it; same speed on data that happens to be aligned, and on most CPUs also for loads/stores that don't split across a cache-line boundary.

Related: How to solve the 32-byte-alignment issue for AVX load/store operations? for C and C++ ways of aligning things, including arrays in automatic (stack) or dynamic storage.