Search code examples
c++concurrencyatomicmemory-barrierscarries-dependency

Understanding std::kill_dependency


Consider an example from the book "Concurrency in Action" by Antony Williams where we have a global read-only array and use std::memory_order_consume when retrieving an index into that array from another thread:

int global_data[]={ ... };
std::atomic<int> index;
void f()
{
    int i=index.load(std::memory_order_consume);
    do_something_with(global_data[std::kill_dependency(i)]);
}

Does std::kill_dependency() suggest the compiler that it can reorder access to i? So it can do something like this:

predicted_i = ...; // unordered/predicted read of i
i = index.load(std::memory_order_consume); // ordered load of i
global_data = ...; // ordered read of global_data array memory
elem = global_data[predicted_i];

Or does std::kill_dependency() let the compiler know that it doesn’t need to reread the contents of the array or array element? So it can do something like this:

predicted_global_data = ...; // unordered read of global_data array memory
i = index.load(std::memory_order_consume); // ordered load of i
elem = predicted_global_data[i]; 

Now consider a more complex example from cppreference:

struct Foo
{
    int* a;
    int* b;
};
 
std::atomic<Foo*> foo_head[10];
int foo_array[10][10];
 
// consume operation starts a dependency chain, which escapes this function
[[carries_dependency]] Foo* f(int i)
{
    return foo_head[i].load(memory_order_consume);
}
 
// the dependency chain enters this function through the right parameter and is
// killed before the function ends (so no extra acquire operation takes place)
int g(int* x, int* y [[carries_dependency]])
{
    return std::kill_dependency(foo_array[*x][*y]);
}


int c = 3;
void h(int i)
{
    Foo* p;
    p = f(i); // dependency chain started inside f continues into p without undue acquire
    do_something_with(g(&c, p->a)); // p->b is not brought in from the cache
    do_something_with(g(p->a, &c)); // left argument does not have the carries_dependency
                                    // attribute: memory acquire fence may be issued
                                    // p->b becomes visible before g() is entered
}

Comments for g(p->a, &c) state that memory acquire fence may still be issued despite despite the fact that there is std::kill_dependency in the first call to g(). Could somebody explain why? I would expect that after std::kill_dependency in the first call to g() the compiler is free to reorder accesses to data dependent on p.


Solution

  • Background

    The idea with memory_order_consume is that most CPUs will respect address and data dependencies (but not control dependencies) for purposes of memory ordering. So taking ARM64 as an example, if we have

    ldr x0, [index]
    ldr x1, [global_data, x0]
    

    then the load from global_data will not be reordered with the load of index, even though the load from index doesn't have an acquire barrier (ldar). So if index was stored with a release store (stlr) in another thread, the load from global_data will observe all memory effects that preceded the store. And this promise carries over through arbitrarily long chains of dependency through registers and memory. So for instance

    ldr x0, [index]
    add x2, x0, x4
    shl x5, x2, #3
    ldr x1, [x5]
    

    also ensures that the second load would see all effects that preceded a release store of index, because there's a chain of instructions connecting the two, each of which uses as one of its inputs an output of the previous one. Note this applies even if the output doesn't mathematically depend on the input; sub x5, x2, x2 would still carry a dependency from x2 to x5, even though the result in x5 is always 0.

    On the other hand, given a control dependency like

    ldr x0, [index]
    cbnz x0, elsewhere
    ldr x1, [data]
    

    the machine is allowed to predict the branch as not taken, and perform the load of data before the load of index, discarding it if the branch turns out to have been taken after all. The load of data is not guaranteed to see memory effects that preceded the release store of index.

    At the level of C++, when the compiler sees a memory_order_consume load, it is supposed to ensure that wherever the result of that load is used syntactically, the generated code actually preserves a dependency through registers and memory. And so for instance if your code were to contain

    i = index.load(std::memory_order_consume);
    t = data[i-i];
    

    the compiler would be obliged to emit something like

    ldr x0, [index]
    sub x2, x0, x0
    ldr x1, [data, x2]
    

    But the memory_order_consume demands that it not optimize this into

    ldr x0, [index]
    ldr x1, [data]
    

    which although it looks mathematically equivalent, does not preserve the dependency. The consume ordering also prohibits the compiler from transforming this address dependency into a control dependency, which would have the same issue.

    As you can see, this makes it rather complicated for the compiler to keep track of what optimizations are or are not allowed, and can require it to refrain from very basic optimizations which it might otherwise do as a matter of course. This is part of why AFAIK no mainstream compiler has actually implemented memory_order_consume, and simply treats it as memory_order_acquire instead. So all of this discussion from here on out is theoretical.


    Q1

    The purpose of kill_dependency is to release the compiler from these prohibitions, in settings where the ordering is not really needed. So given

    int i=index.load(std::memory_order_consume);
    do_something_with(global_data[std::kill_dependency(i)]);
    

    then yes, the compiler would be allowed to predict the value of index, and perform the load from global_data before the load from index.

    But it can only use the result of that load if its guess about the value of index turned out to match the value actually loaded. The transformations you propose don't make sense because they don't contain any such check. (Though maybe you meant that to be implied?) But the following dependency-breaking optimizations would be legitimate:

    // version A
    predicted_i = ...;
    predicted_data = global_data[predicted_i];
    i = index.load();
    if (i == predicted_i)
        do_something_with(predicted_data);
    

    Or:

    // version B
    memcpy(cached_data, global_data, size);
    i = index.load();
    do_something_with(cached_data[i]);
    

    Or:

    // version C
    predicted_i = ...;
    i = index.load(memory_order_relaxed);
    if (i == predicted_i)
        do_something_with(global_data[predicted_i]);
    

    In version C, although the load from global_data appears to come after the load of i, it's only guarded by a control dependency, so an ARM64 CPU would be free to reorder them.

    To convince yourself that all these transformations are allowed, note that they would only change the program's observable behavior if there was concurrent writing to global_data. The presence of kill_dependency has destroyed any hope we might have had of proving that such modification happens-before our read of global_data; it no longer carries a dependency from the read (intro.races p7.1.1). So if there is concurrent writing, then the program has a data race and its behavior is undefined anyway.

    In effect, the kill_dependency negates all the guarantees that were originally provided by memory_order_consume. Your example is thus rather pointless, since it ends up equivalent to just loading index with memory_order_relaxed. Of course, the intended use case of kill_dependency was if there were other uses of i whose dependencies should be preserved.

    Q2

    The kill_dependency in the return statement of g only kills that chain of the dependency. So evaluations that make use of the return value of g (i.e. inside do_something_with()) do not have to preserve the dependency on the load from foo_head[i], and can be reordered with it unless some other barrier or dependency prevents it.

    But the use of p in the arguments of the second call to g is derived directly from the load of foo_head[i], not via the return value of the first call to g, and so the earlier kill_dependency is irrelevant to it. It's a separate chain of dependency which does not go through a kill_dependency, so it must be preserved.

    To understand why an acquire barrier is therefore needed in the second call to g, keep in mind that the requirement to preserve dependency on the load of foo_head[i] applies to all evaluations which depend on it syntactically, and that includes evaluations which are inside other functions. Because of the [[carries_dependency]] on Foo (see below), the compiler has been put on notice when compiling h that dependencies derived from Foo()'s return value are properly ordered.

    So when compiling the call g(p->a, &c) within h(), the compiler is still on the hook to ensure that any evaluations inside g() that depend syntactically on g's first parameter remain properly ordered with the load of foo_head[i]. But unless g is being inlined, the compiler generating code for h has no way to ensure that g doesn't perform some of the dependency-breaking optimizations described above, which could then allow the forbidden reordering to occur. After all, g itself had no idea that its first argument might have come from a memory_order_consume load, so there was no reason to avoid such optimizations when compiling g. This would be even more clear if g was defined in a different translation unit - then there is no possible way for h to know what g might do, yet h is still responsible to maintain ordering.

    Thus, the only way for h to ensure that ordering is preserved is to emit an acquire barrier before the call to g, thus guaranteeing that none of the evaluations within g can be reordered with the earlier load of foo_head[i].

    The [[carries_dependency]] attribute is the way around this. When a function argument is declared with [[carries_dependency]], as with the second argument of g, it's a contract that g will ensure that any syntactic uses of its second argument are executed so as to preserve the dependency; it will not perform any of the above dependency-breaking optimizations on them. So in generating the first call to g, the compiler doesn't need an acquire fence, since g itself is promising to keep things in order with respect to its second argument.

    In the second call, we pass p->a through the first argument, which doesn't have [[carries_dependency]]; so g has made no such promise, and the only way to be safe is to emit an acquire fence.

    There's a similar situation in Foo. Again, since Foo made a memory_order_consume load, it's on the hook to ensure that every evaluation derived from that load preserves the ordering. Without [[carries_dependency]] on the return value, it would have no guarantee that its caller would respect that, so the only way to be safe is to emit a barrier before returning. But with [[carries_dependency]], the caller is assigned the responsibility to preserve dependencies in evaluations derived from the return value of Foo, just as it would with evaluations derived directly from a memory_order_consume load within the caller itself. Then Foo is off the hook and need not emit a barrier.