multithreading assembly x86 synchronization locking

Does UMWAIT make the process do REP NOP or context switch immediately?

Does calling UMWAIT make the process to do REP NOP (= keep using its hardware thread, not evicted, but use less power by not issuing the uOPs to the processor backend) until its scheduled time is over? Or, does it make the process to be evicted right away through context switch?

Solution

Yes, umwait (the user-mode version of mwait with a limit on how deep a sleep it can do) is basically like pause (encoded as rep nop, which is how it executes on ancient CPUs that don't recognize it as a pause instruction).

It doesn't make a yield() system-call or otherwise trap to the OS. Same for mwait in kernel mode; it sleeps the CPU core, not traps. Kernels use it to put the CPU into a C-state until the next interrupt. (I think it was originally designed for actually waiting for memory writes from another core, but now one of the primary purposes is an API that includes a sleep level, unlike hlt, so it's how CPUs expose deep sleep levels. The waiting for memory use-case is still supported, too.)

If it just trapped so the OS could context switch, it wouldn't need to exist. int or syscall instructions already exist. Or in a kernel, a simple call schedule would potentially context-switch.

UMWAIT will put the core into C0.2/C0.1 state to save power. ... if the other SMT thread is active, most of backend/frontend will be active to C0.0, and if the other SMT thread is not active, then it will probably go into C1~ state.

Yeah, if the other logical core is still active, the physical core should keep running. (And maybe switch back to "single-threaded" mode, allowing the other logical core to use the full ROB and store buffer, and similarly un-partitioning any other statically-partitioned resources. Check perf stat -e cpu_clk_unhalted.one_thread_active against the case where the other thread is fully idle.)

I don't know the details on what sleep levels real microarchitectures actually have in practice, and how the on-paper levels of sleep map to them. It might be a more shallow sleep if regular C1 doesn't have a low enough wake-up latency, since some OSes would definitely want to stop user-space from doing anything too high latency to meet realtime guarantees it wants to provide.