Am I understanding memory ordering models correctly?

I am learning about C memory ordering models, and I came up with this little code for a producer and consumer sharing a "bucket" with "units".

I intended to create this sequence of "happens before" relationships between producer (P) and consumer (C):

P: write bucket [fill]
P: release store bucket_full
C: acquire load bucket_full
C: write,read bucket [decrement,print]
C: release store bucket full
P: acquire load bucket_full
P: write bucket [fill] (the cycle repeats)

(I understand that there are cleaner ways to achieve a producer-consumer duo, this is just an exercise)

Question(s):

Am I establishing the "happens before" relationships correctly?
Am I actually incurring in a data race and by sheer luck it is just not visible?

Code:

#include <stdio.h>
#include <stdatomic.h>
#include <pthread.h>

int num_buckets = 30, bucket = 0, units_per_bucket = 12;
_Atomic int bucket_full = 0;

void *producer(void *arg){
    for(int i=0; i<num_buckets+1; ++i){
        //acquire load to wait for "empty bucket" signal
        while(atomic_load_explicit(&bucket_full, memory_order_acquire))
            sched_yield();
        if(i == num_buckets){ // no more buckets to fill
            atomic_store_explicit(&bucket_full, -1, memory_order_release);
            break;
        }
        bucket = units_per_bucket; // fill bucket
        printf("\n[%2i]: %d", i+1, bucket);
        // release store to signal "bucket is full"
        atomic_store_explicit(&bucket_full, 1, memory_order_release);
    }
    return arg;
}

void *consumer(void *arg){
    while(1){
        //acquire load to wait for "bucket is full" signal
        while(!atomic_load_explicit(&bucket_full, memory_order_acquire))
            sched_yield();
        if(bucket_full == -1) //no more buckets, exit
            break;
        bucket --;
        printf(" %d", bucket);
        if(bucket < 1) // release store to signal "bucket is empty"
            atomic_store_explicit(&bucket_full, 0, memory_order_release);
    }
    return arg;
}

int main(){
    pthread_t t0, t1;
    pthread_create(&t0, NULL, consumer, (void *)NULL);
    pthread_create(&t1, NULL, producer, (void *)NULL);
    pthread_join(t0, NULL);
    pthread_join(t1, NULL);
    printf("\n");
    return 0;
}

compiled with:

gcc -std=c11 -pthread -O2 -Wall -Wextra -Werror -pedantic -pedantic-errors main.c -o main

thanks.

Follow-up:

There is a follow up to this question.

Solution

Looks ok to me with a single producer and single consumer since they can't do any work in parallel; one spin-waits for the other after doing a release-store that will let the other leave its loop. And the loads are acquire so yes, that creates a happens-before.

The consumer doesn't need to re-check bucket_full while counting down bucket toward 0; once it sees bucket_full being non-zero (and not -1 with a separate seq_cst load since you didn't load into a temporary in the spin-loop), you might as well have an inner loop like while(--bucket >= 1){ ... } unless you want to be able to have a third thread (e.g. the main thread) set bucket_full = -1 as an exit_now flag to the consumer.

The way you've written it, the consumer won't be able to keep the non-atomic bucket in a register while its counting down from 12, since each acquire load potentially means you could have synced with another thread that stored those variables. (That's probably not actually possible without causing a potential data race, but compilers are somewhat conservative about that. Also, calling a non-inline function like printf will also make a compiler assume that all globals could have been modified. And bucket is global not static.)

Of course, this whole algorithm is hopefully just a learning exercise. You aren't getting any parallelism between threads since one is always spin-waiting while the other works, and there's no easy way to modify that. Except I guess to have the producer do some work before spin-waiting for bucket_full == 0, e.g. to prepare a non-trivial value to put in the one bucket.

Single-producer single-consumer (SPSC) queues using an array exist and work well if that's what you want, allowing both sides to spend most of their CPU time on work not spinning if they run at similar speeds.

If you're testing on x86 hardware, keep in mind that its memory model is quite strong; every asm load and store is implicitly acquire and release respectively. So often only compile-time reordering can cause things that the C++ memory model allows. You can use clang -O2 -fsanitize=thread which can detect some runtime problems.