Writing uint32 to uint64 is not atomic. Why?

In Is Parallel Programming Hard, And, If So,What Can You Do About It on page 410 it is written:

Quick Quiz 5.17:

Why doesn’t inc_count() in Listing 5.4 need to use atomic instructions?

Answer:
(..) atomic instructions would be needed in cases where the per-thread counter variables were smaller than the global global_count (..)

Simplifying, that sentence applies to the following example:

uint64 global_count = 0;

void f(){
    uint32 sum = sum_of_smaller_thread_locals(); # sum is a variable 
    WRITE_ONCE(global_count, sum);
}

I cannot understand why do we need atomic instructions in that case?

Solution

As Peter Cordes points out, the atomic instructions would be required for the per-thread increments. The reason is given in the text, but the superfluous 'however' clouds it slightly:

That said, atomic instructions would be needed in cases where the per-thread counter variables were smaller than the global global_ count. ~~However,~~ note that on a 32-bit system, the per-thread counter variables might need to be limited to 32 bits in order to sum them accurately, but with a 64-bit global_count variable to avoid overflow. In this case, it is necessary to zero the per-thread counter variables periodically in order to avoid overflow. It is extremely important to note that this zeroing cannot be delayed too long or overflow of the smaller per-thread variables will result. This approach therefore imposes real-time requirements on the underlying system, and in turn must be used with extreme care.

In contrast, if all variables are the same size, overflow of any variable is harmless because the eventual sum will be modulo the word size.

If the main thread clears the per-thread counters, it needs to do this via an atomic exchange to avoid possible data loss. If the per-thread increments do the clearing, to avoid data loss they would need some other (likely more complex) kind of interlock.