Double checked locking implementation - using atomic vs not using

I saw this cpp core guideline and I'm trying to understand something:

http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#cp111-use-a-conventional-pattern-if-you-really-need-double-checked-locking

Why is the first example marked as bad? Is it only because the variable is volatile? What harm can happen if the first check is not thread-safe assuming it's protected by the mutex? At the worst case we will stumble into a locked mutex but once it's unlocked we won't run the "only once" code.

Consider these 2 options:

class A{};
A() costly_init_function();
std::optional<A> a;

//option 1   
std::once_flag a_init;
const A& foo() {
    if (!a) {
      std::call_once(a_init, [](){ a.emplace(costly_init_function()); });
    }
    return *a;
}

//option 2
 std::atomic_bool a_init_flag = false;
 std::mutex mutex;
 const A& moo() {
  if (!a_init_flag) {
    std::lock_guard lock{mutex};
    if (!a_init_flag) {
      a.emplace(costly_init_function());
      a_init_flag= true;
     }
  }
  return *a;
 }

Is there any actual issue that may happen in option 1? To me, it seems like the worst that could happen is that we access a in a not thread safe way and as a result we wait to the call_once to finish and then we simply skip to returning *a. Should I prefer option 2 which is more expensive but somewhat safer?

edit:

it seems people are thinking rephrasing my question and explaining it in more details is actually an answer. I would have just deleted the question but apparently I can't

so:

yes I know the first if (!a) is not thread safe
yes I know the same results can be achieved by only calling std::call_once
yes I know what a volatile means

trying to be more specific at what I'm asking:

I don't want this function to use a mutex or whatever std::call_once is doing in order to lock because it hurts my running time. (mutex - alot, call_once about 1/3 of the time the mutex takes but still 7 times more than "option 1"
I first implemented the first option, knowing the first operation is not thread safe, assuming the worst that can happen is that the race condition is causing me to get into the call_once which has the actual protection and this only some running time will be wasted on the first few calls to the function.
somebody pointed the linked cpp core guidline, telling me I should probably use an atomic bool in order to also make everything thread safe.
this obviously has an additional running time cost but looking at the bad example it seemed to me that they are comparing apples to oranges because the used volatile.
I was unsure if volatile is essential to make the code "bad" or will using a regular storate variable also may cause an unwanted issue
came here to ask this, instead got a discussion of whether I should use a local static variable, as if this the actual code being run and not a simplified example
BTW maybe the answer is simply 'every race condition is UB even if you can "proove" that the execution flow is the same as in this example'

Solution

Making the function thread-safe bears a cost. Yes, example 1 is faster, that is because it is not doing the required job.

There is a fundamental error in your thinking, that you can reason about concurrent accesses without atomics or syncing. The assumption is wrong. Any concurrent read and write from a non-atomic variable is UB.

intro.races

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.

Note that "happens before" has a precise definition. Simply running function A 5 minutes after function B does not validate as "B happens before A".

In particular, a possible outcome there, is another thread sees a as true before it sees the result of the init function.

The problem is not the volatile, it's the lack of sequencing. This cppcoreguideline only highlights the fact that volatile does not offer sequencing guarantees, and as such exhibits the same undefined behavior as if the volatile keyword was not there.

In short:

With no synchronisation, there is 0 guarantee on anything.

Eg:

int x = 0;
int y = 0;

void f() {
    y = 1;
    x = 1;
}

Assuming at least one thread runs this function and you have no other synchronisation mechanism in place, a thread reading those can observe y == 0 and x == 1.

On a technical level:

the thread that calls f has no reason to flush the writes to memory, and if it does it has no reason to do so in any particular order.
the thread that reads x and y has no reason to invalidate its cache and fetch the modified values from memory, and if it does it has no reason to do so in any particular order (eg: it might have y in cache but not x, so it will load x from memory and see the updated value, but use the cached y).