c++asynchronous assembly timer lock-free

How hardware timers work and affect on software performance?

I want to use async functions calling. I chose boost::deadline_timer.

For me, hardware-timer is a specific hardware (surprisingly), that works independently from CPU and is duty only for monitoring time. At the same time, if I understand correctly, it is also can be used for setting timeout and generating an interrupt when the timeout has been reached. (timers)

The primary advantage of that is asynchronous execution. The thread that set a timer can continue working and the callback function will be triggered in the same thread where the timer has been set.

Let me describe as I see it in action.

The application contains one or more worker threads. E.g. they process input items and filter them. Let's consider that application has 5 threads and each thread set one timer (5 seconds).
Application is working. E.g. current thread is thread-3.
Timer (thread-0) has been expired and generates (probably the wrong term) an interrupt.
Thread-context switching (thread-3 -> thread-0);
Callback function execution;
Timer (thread-1) has been expired and generates interruption.

...

And so on

P.S.0. I understand that this is not only one possible case for multi-threaded application.

Questions:

Did I describe the working process rightly?
Do I understand correctly that even current thread is thread-0 it also leads to context-switching, since the thread has to stop to execute current code and switch to execute the code from callback fuction?
If each thread sets 100k or 500k timers how it will affect on performance?
Does hardware have the limit to count of timers?
How expensive to update the timeout for a timer?

Solution

A hardware timer is, at its core, just a count-up counter and a set of comparators (or a count-down counter that uses the borrow of the MSb as an implicit comparison with 0).
Picture it as a register with a specialized operation Increment (or Decrement) that is started at every cycle of a clock (the easiest kind of counter with this operation is the Ripple-counter).
Each cycle the counter value is also fed to the comparator, previously loaded with a value, and its output will be the input to the CPU (as an interrupt or in a specialized pin).
In the case of a count-down counter, the borrow from the MSb acts as the signal that the value rolled over zero.
These timers have usually more functions, like the ability to stop after they reach the desired value (one-shot), to reset (periodic), to alternate the output state low and high (square wave generator), and other fancy features.

There is no limit on how many timers you can put on a package, of course, albeit simple circuits, they still have a cost in terms of money and space.
Most MCUs have one or two timers, when two, the idea is to use one for generic scheduling and the other for high-priority tasks orthogonal to the OS scheduling.
It's worth noting that having many hardware timers (to be used by the software) is useless unless there are also many CPUs/MCUs since it's easier to use software timers.
On x86 the HPET timer is actually made of at most 32 timers, each with 8 comparators, for a total of 256 timers as seen from the software POV.
The idea was to assign each timer to a specific application.

Applications in an OS don't use the hardware timers directly, because there can possibly be a lot of applications but just one or two timers.
So what the OS does is share the timer.
It does this by programming the timer to generate an interrupt every X units of time and by registering an ISR (Interrupt Service Routine) for such an event.
When a thread/task/program sets up a timer, the OS appends the timer information (periodic vs one-shot, period, ticks left, and callback) to a priority queue using the absolute expiration time as the key (see Peter Cordes comments below) or a list for simple OSes.
Each time the ISR is called the OS will peek at the queue and see if the element on top is expired.

What happens when a software timer is expired is OS-dependent.
Some embedded and small OS may call the timer's callback directly from the context of the ISR.
This is often true if the OS doesn't really have a concept of thread/task (and so of context switch).
Other OSes may append the timer's callback to a list of "to be called soon" functions.
This list will be walked and processed by a specialized task. This is how FreeRTOS does it if the timer task is enabled.
This approach keeps the ISR short and allows programming the hardware timer with a shorter period (in many architectures interrupts are ignored while in an ISR, either by the CPU automatically masking interrupts or by the interrupt controller).
IIRC Windows does something similar, it posts an APC (Async Procedure Call) in the context of the thread that set the software timer just expired. When the thread is scheduled the APC will (as a form of a window's message or not, depending on the specific API used) call the callback. If the thread was waiting on the timer, I think it is just set in the ready state. In any case, it's not scheduled right away but it may get a priority boost.

Where the ISR will return is still OS-dependent.
An OS may continue executing the interrupted thread/task until it's scheduled out. In this case, you don't have step 4 immediately after step 3, instead, thread3 will run until its quantum expires.
On the other way around, an OS may signal the end of the ISR to the hardware and then schedule the thread with the callback.
This approach doesn't work if two or more timers expire in the same tick, so a better approach would be to execute a rescheduling, letting the schedule pick up the most appropriate thread.
The scheduling may also take into account other hints given by the thread during the creation of the software timer.
The OS may also just switch context, execute the callback and get back to the ISR context where it continues peeking at the queue.
The OS may even do any of that based on the period of the timer and other hints.

So it works pretty much like you imagined, except that a thread may not be called immediately upon the timer's expiration.

Updating a timer is not expensive.
While all in all the total work is not much, the timer ISR is meant to be called many many times a second.
In fact, I'm not even sure an OS will allow you to create such a huge number (500k) of timers.
Windows can manage a lot of timers (and their backing threads) but probably not 500k.

The main problem with having a lot of timers is that even if each one performs little work, the total work performed may be too much to keep up with the rate of ticking.
If each X units (e.g. 1ms) of time 100 timers expire, you have X/100 units of time (e.g. 10us) to execute each callback and the callback's code may just be too long to execute in that slice of time.
When this happens the callbacks will be called less often than desired.
More CPU/cores will allow some callback to execute in parallel and would alleviate the pressure.

In general, you need different timers if they run at different rates, otherwise, a single timer that walks a data structure filled with elements of work/data is fine.
Multi-threading can provide concurrency if your tasks are IO-bounded (files, network, input, and so on) or parallelism if you have a multi-processor system.