Search code examples
c++data-structuresvalgrind

How to safely implement "Using Uninitialized Memory For Fun And Profit"?


I would like to build a dense integer set in C++ using the trick described at https://research.swtch.com/sparse . This approach achieves good performance by allowing itself to read uninitialized memory.

How can I implement this data structure without triggering undefined behavior, and without running afoul of tools like Valgrand or ASAN?

Edit: It seems like responders are focusing on the word "uninitialized" and interpreting it in the context of the language standard. That was probably a poor word choice on my part - here "uninitialized" means only that its value is not important to the correct functioning of the algorithm. It's obviously possible to implement this data structure safely (LLVM does it in SparseMultiSet). My question is what is the best and most performant way to do so?


Solution

  • I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.

    Ignore the standard, implement it in the usual way

    This is the one crazy trick language lawyers hate! Don't freak out yet though - the solutions following this one won't break the rules, so just skip this part if you are of the rules-stickler variety.

    The standard makes most uses of uninitialized values undefined and the few loopholes it does allow (e.g., copying one undefined value to another) don't really give you enough rope to actually implement what you want - even in C which is slightly less restrictive (see for example this answer covering C11, which explains that while accessing an indeterminiate value may not directly trigger UB anything that results is also indeterminate and indeed the value may appear to chance from access to access).

    So you just implement it anyway, with the knowledge that most or all currently compilers will just compile it to the expected code, and know that your code is not standards compliant.

    At least in my test all of gcc, clang and icc didn't take advantage of the illegal access to do anything crazy. Of course, the test is not comprehensive and even if you could construct one, the behavior could change in a new version of the compiler.

    You would be safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it nearly impossible (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed.

    Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.

    If you take this approach, tools like valgrind are fairly likely to report a uninitialized read error.

    Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.

    More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.

    Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.

    Use assembly

    What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.

    Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member function, for example1, looks like:

    sparse_array::is_member(unsigned long):
            mov     rax, QWORD PTR [rdi+16]
            mov     rdx, QWORD PTR [rax+rsi*8]
            xor     eax, eax
            cmp     rdx, QWORD PTR [rdi]
            jnb     .L1
            mov     rax, QWORD PTR [rdi+8]
            cmp     QWORD PTR [rax+rdx*8], rsi
            sete    al
    

    Rely on calloc magic

    If you use calloc2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc may simply call malloc and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk and mmap, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.

    As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.

    In that case, implementing your sparse set on top of calloc could be just as fast as the nominally uninitialized version, while being safe and standards compliant.

    Calloc Caveats

    You should of course test to ensure that calloc is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:

    calloc(N)
      if (can satisfy a request for N bytes from allocated-then-freed memory)
        memset those bytes to zero and return them
      else
        ask the OS for memory, return it directly because it is zeroed
    

    Basically, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free or delete or whatever. The memory in that pool has arbitrary values, and if a calloc can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.

    On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set objects, you will generally just be drawing from the internal malloc pools and will pay the zeroing costs. If you have a long-lived sparse_set objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.

    The good news is that if you don't want to rely on the calloc behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.

    Use Lazy Initialization

    For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.

    First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse array.

    For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.

    Now you simply check the corresponding bit in this array before accessing sparse. This adds some small cost to accessing the sparse array, but doesn't change the overall complexity, and the check is still quite fast.

    For example, your is_member function now looks like:

    bool sparse_set::is_member(size_t i){
        bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
        return init && sparse[i] < n && dense[sparse[i]] == i;
    }
    

    The generated assembly on x86 (gcc) now starts with:

        mov     rax, QWORD PTR [rdi+24]
        mov     rdx, rsi
        shr     rdx, 8
        mov     rdx, QWORD PTR [rax+rdx*8]
        xor     eax, eax
        bt      rdx, rsi
        jnc     .L2
        ...
    .L2:
        ret
    

    That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).

    In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member is about the worse case for this approach since some functions (e.g., clear) aren't affected at all, and others (e.g., iterate) can batch up the is_init check and only do it once every INIT_COVERAGE elements (meaning the overhead would be again ~1% for the example values).

    Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse array, so you may great reduce misses to DRAM or outer cache levels.

    The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.

    Really Lazy Solution

    It is worth noting that there is a similarity between the calloc and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.

    The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.

    You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense array at all: just allocate your sparse array with mmap as PROT_NONE, so you fault whenever you read from an un-allocated page in a sparse array. You catch the fault and allocate the page in the sparse array with a sentinel value indicating "not present" for every element.

    This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i checks any more.

    The downsides are:

    • Your code is really not portable since you need to implement the fault-handling logic, which is usually platform specific.
    • You can't control the granule size: it must be at page granularity (or some multiple thereof). If your set is very sparse (say less than 1 out of 4096 elements occupied) and uniformly distributed, you end up paying a high initialization cost since you need to handle a fault and initialize a full page of values for every element.
    • Misses (i.e., non-insert accesses to set elements that don't exist) either need to allocate a page even if no elements will exist in that range, or will be very slow (incurring a fault) every time.

    1 This implementation has no "range checking" - i.e., it doesn't check if i is greater than MAX_ELEM - depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.

    2 Really, the only requirement that you use something that calls calloc under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]() just do the allocation followed by a memset. gcc does optimize malloc followed by memset into calloc, but that's not useful if you are trying to avoid the use of C routines anyway!

    3 Precisely, you need 1 extra bit to track every 128 bits of the sparse array.