c++segmentation-fault undefined-behavior simd intrinsics

SIMD load across memory boundary doesn't cause segfault?

Suppose I do a (unaligned) packed load _mm256_loadu_pd on a double (see code snippet below). Does this violate the strict aliasing rule or otherwise result in undefined behavior as per C++ standard? Shouldn't it trigger a segmentation fault in theory?

If not, can this behavior be relied upon? (This is useful when, say, loading 3 doubles in one go.)

The following code compiles without warnings (gcc-14.2.1 g++ -mavx -pedantic -Wall) and runs fine (on GNU/Linux 6.13.2-arch1-1):

#include <immintrin.h>
#include <cstdio>

int main() {
    double* ptr = new double {};
    double buf[4];
    _mm256_storeu_pd(buf, _mm256_loadu_pd(ptr));
    delete ptr;
    std::printf("%e\n", buf[3]);
}

No segfaults whatsoever. ASan (-fsanitize=address) does report heap buffer overflow though.

Edit: Following Wenzel's link, I think this is not UB (because it's not the C++ standard that defines the SIMD types), but the behavior may be implementation dependent. My question is more like why this load is allowed (by kernel? by hardware?) at all? Doesn't this give me access to memory that I do not own?

Solution

As @gnasher729 says, UB doesn't mean a fault is required. Just the opposite: the compiler can assume it doesn't happen, because whatever does happen at any point earlier or later in the program (including continuing to run normally) is allowed by the ISO C++ standard.

Hardware memory protection happens with page granularity. As far as the kernel is concerned, your process owns the whole 4K or 2M page containing the double. The finer-grained bookkeeping for new/delete is only done in user-space.

A load that includes any valid bytes will only fault if it crosses into the next page and that next page is unmapped. This can't happen for an aligned load that accesses any valid bytes, since the page size is (much) larger than the vector width. This enables SIMD vectorization even for algorithms like strlen where the last valid byte isn't known until we actually read it. (With some address math for the start of the loop, then using aligned loads in the loop.)
See Is it safe to read past the end of a buffer within the same page on x86 and x64?

new could return a pointer to a double in the last 8 or 16 bytes of a page, but happens not to for the first allocation in a fresh program. (The memory will be aligned by at least alignof(double), which is only 4 on i386 Linux. In practice libstdc++ / glibc will return memory sufficiently aligned for max_align_t, which is 16 on i386 and x86-64 GNU/Linux. But only 8 on 32-bit x86 Windows.)

(There is some research into memory-safe ISAs: Why can't we have a safe ISA? - for example the CHERI project based on RISC-V. Until/unless we're compiling for and running on something like that, we can't expect hardware to trap accesses outside object bounds within a page. On current hardware, we can only get that with extra software checking that comes at great performance cost, like valgrind or -fsanitize=address)

Your store is to a 32-byte array (double buf[4];) so you're not corrupting anything with it. Storing past the end of an object is very bad even if you don't fault: you could corrupt the allocator's bookkeeping data, or the payload of another allocation.

Even load / blend / store (putting back the same bytes you loaded outside the bytes you own) isn't thread-safe: you could have reverted a modification by another thread. For this reason, compilers must not invent stores to objects the abstract machine hasn't at least read; for things you have read, it would be data-race UB if another thread was writing it. But compilers often don't invent stores even then. So write arr[i] = cond ? x : arr[i]; to allow auto-vectorization with a SIMD blend, instead of if(cond) arr[i] = x; for cases where no other thread is accessing this region of the array so it is safe to load/blend/store. (Fun fact: SVE, AVX-512, and AVX2 for float/double, have masked stores that allow vectorization of conditional stores, but the AVX2 ones are not fast on AMD even with Zen 4, even though it handles AVX-512 masked stores efficiently. https://uops.info/ - vmaskmovpd stores and vmovapd mem{k}, ymm)

I think this is not UB (because it's not the C++ standard that defines the SIMD types), but the behavior may be implementation dependent.

That reasoning is faulty. Language extensions can define some but not all ways of using them. You are reading past the end of a new double{} allocation, which is UB.

It's even visible at compile-time. But in practice compilers currently just generate asm instructions that load from the address you ask it to load from. Especially when you don't enable optimization.

The alternative would be for the compiler to assume this path of execution is unreachable since compilers can in general assume that programs don't encounter undefined behaviour. If they actually do, the standard doesn't require any specific behaviour for any of your program before or after; literally anything is allowed to happen, including continuing to run like nothing happened. UB is the opposite of "exception required" / "must trap"; it allows optimizers to make assumptions. Sanitizers like -fsanitize=undefined or -fsanitize=address do change that, though, making some kinds of UB an error that gets reported.

following Wenzel's link,

It's not a duplicate of *Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? there's no raw deref of __m256d*, only opaque loadu and storeu functions which do unaligned aliasing-safe loads and stores. They're equivalent to memcpy in terms of correctness and which bytes are accessed.