While investigating a critical path in our python codebase, we found out that the behaviour of ctypes in terms of latencies is quite unpredictable.
A bit more background of our application. We have bunch of processes where each of them communicate through shared memory. We leverage python library multiprocessing.RawValue
and multiprocessing.RawArray
which internally uses ctypes
for data management. While running this in production, we saw that even a simple get()
access on these shared data types takes around 30-50 us and sometimes 100us and that's quite slow. Even for python.
I have created this bare bone example which creates a ctype
structure and exposes get()
method
import ctypes
import sys
import time
import numpy as np
import random
from decimal import Decimal
def get_time_ns():
return Decimal(str(time.time_ns()))
class Point(ctypes.Structure):
_fields_ = [("x", ctypes.c_int),
("y", ctypes.c_int)]
def __init__(self, x, y):
return super().__init__(x, y)
def get(self):
return self.x
#return str(self.x) + "," + str(self.y)
def benchmark(delay_mode):
p = Point(10, 20)
iters = 10
while iters:
start_ts = get_time_ns()
_ = p.get()
end_ts = get_time_ns()
print("Time: {} ns".format(end_ts - start_ts))
iters -= 1
if delay_mode == 1:
time.sleep(random.uniform(0, 0.1))
benchmark(int(sys.argv[1]))
When I run this in no sleep mode, The latency numbers are as follows
[root@centos-s-4vcpu-8gb-fra1-01 experiments]# python3.9 simple_ctype.py 0
Time: 9556 ns
Time: 2246 ns
Time: 1124 ns
Time: 1174 ns
Time: 1091 ns
Time: 1126 ns
Time: 1081 ns
Time: 1066 ns
Time: 1077 ns
Time: 1138 ns
And when I run this in sleep mode, the latency numbers are as follows
[root@centos-s-4vcpu-8gb-fra1-01 experiments]# python3.9 simple_ctype.py 1
Time: 27233 ns
Time: 27592 ns
Time: 31687 ns
Time: 32817 ns
Time: 26234 ns
Time: 32651 ns
Time: 29468 ns
Time: 36981 ns
Time: 31313 ns
Time: 34667 ns
The reason for using sleep
is to simulate our production environment where application is doing more than just running this loop
Can someone explain me the reason for this 10 - 20X increase in latency when there are interrupts compared to the above hot loop. My best guess is CPU cache miss but that still does not explain such latency increase. I am also quite confused on how ctypes actually manage memory. Is it just plain malloc
or mmap
and malloc
. And last but not least, It would be really great if someone can help us optimise this.
System Information: CentOS 7.9, 4 core CPU, 16 GB RAM. taskset
to pin specific CPU core to the script
FYI, We already know that C++/Rust is better for this high precision performance than high level language like python but considering the time sensitivity and other business reasons, we would like to optimise our python code for performance before we actually hit the language barrier
There are multiple reason for a code to be slower sleeping. Here, the 4 main reasons are the frequency scaling, the TLB/cache misses and the branch misses. All of them are due to context switches mixed with a long period of CPU inactivity. The problem is independent of ctypes
.
When a mainstream modern processor does not have an intensive task to compute, it automatically reduces its frequency (with the agreement of the operating system which can be configured). It is similar to a human sleep: when you have nothing to do, you can sleep, and when you wake up, it takes some time before you can operating quickly (ie. dizzy state). This is the same thing for the processor: it takes some time for the processor to switch from a low frequency (used during the sleep call) to a high frequency (using during the computing code). AFAIK, this is mainly because the processor needs to adapt its voltage. This is an expected behavior because it is not energy efficient to switch to the highest frequency directly since the target code might not run for a long time (see hysteresis) and the power consumption grow with ~ frequency**3
(due to the voltage increase required by higher frequencies).
There is a way to check that easily on Linux. You can use a fixed frequency and disable any turbo-like mode. On my i5-9600KF processor I used the following line:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
You can check the state of your CPU using the following lines:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
cat /proc/cpuinfo | grep MHz # Current frequency for each core
Here is results before and after the changes on my machine:
# ---------- BEFORE ----------
$ python3 lat-test.py 0
Time: 12387 ns
Time: 2023 ns
Time: 1272 ns
Time: 1096 ns
Time: 1070 ns
Time: 998 ns
Time: 1022 ns
Time: 956 ns
Time: 1002 ns
Time: 1378 ns
$ python3 lat-test.py 1
Time: 6941 ns
Time: 3772 ns
Time: 3544 ns
Time: 9502 ns
Time: 25475 ns
Time: 18734 ns
Time: 23800 ns
Time: 9503 ns
Time: 19520 ns
Time: 17306 ns
# ---------- AFTER ----------
$ python3 lat-test.py 0
Time: 7512 ns
Time: 2262 ns
Time: 1488 ns
Time: 1441 ns
Time: 1413 ns
Time: 1434 ns
Time: 1426 ns
Time: 1424 ns
Time: 1443 ns
Time: 1444 ns
$ python3 lat-test.py 1
Time: 8659 ns
Time: 5133 ns
Time: 3720 ns
Time: 4057 ns
Time: 3888 ns
Time: 4187 ns
Time: 4136 ns
Time: 3922 ns
Time: 4456 ns
Time: 3946 ns
We can see that the gap is significantly smaller. Moreover, results are much more stable (and reproducible). Note that the performance is reduced when the latency is small because the turbo-boost has been disabled (so my processor do not run at its highest possible frequency). On my machine the factor between the minimum frequency (0.8 GHz) and the maximum one (4.6 GHz) is 5.75 which is pretty big and justify a significant part of the performance gap when the frequency scaling is enabled (default).
A significant part of the latency is lost in the execution of the get_time_ns
. This is a critical point: CPython is a slow interpreter so you cannot measure the time very precisely with it. An empty function call in CPython takes ~45 ns on my machine! The expression Decimal(str('1676949210508126547'))
takes ~250 ns. This is critical to consider this point since you are measuring a latency only 10 times bigger than this in a context where such a code can be significantly slower due to many overheads (including the cache being cold -- see later).
To improve the accuracy of the benchmark, I removed the use of the Decimal module and the string conversions that are expensive and just used integers. Note that even basic integer are far from being cheap in CPython because they have a variable length and are dynamically allocated, not to mention CPython interpret the bytecode at runtime. A simple integer_1 - integer_2
takes ~35 ns on my machine while it should take less than 1 ns in a native compiled code. Fetching the function time_ns
from the time
module also takes about the same time, not to mention the timing function itself takes ~50 ns to execute (to ~85 ns in total for 1 fetch+execution).
I also increased the number of iteration from 10 to 10_000 so for the effect to be more visible during the following profiling part.
In the end, the latency was reduced to 200/1300 ns instead of 1400/4000. This is a huge difference. In fact, 200 ns is so small that at least half of the time is still lost timing overheads and is not the call to p.get()
! That being said, the gap is still here.
A part of the remaining overhead is due to cache misses and TLB misses. Indeed, when a context switch happens (due to the call to sleep
), the CPU caches can be flushed (somehow). In fact, AFAIK, they are indirectly flush on mainstream modern processors during a context-switch : the TLB CPU unit which is a cache responsible for the translations of virtual memory to physical memory is flushed causing cache lines to be reloaded when the thread is scheduled back. It has a significant impact on performance after the process has been scheduled back because data needs to typically be reloaded from the slow RAM or at least a higher-latency cache (eg. LLC). Note that even if it would not be the case, the thread can be scheduled back on a different core having its own private TLB unit so it would result in many cache misses.
Regarding how memory is shared between processes, you may also experience a "TLB shootdown" which is also pretty expensive. See this post and this one for more information about this effect.
On Linux, we can use the great perf
tool so to track performance performance events of the CPU. Here are results for the two use-case for TLBs:
# Low latency use-case
84 429 dTLB-load-misses # 0,02% of all dTLB cache hits
467 241 669 dTLB-loads
412 744 dTLB-store-misses
263 837 789 dTLB-stores
47 541 iTLB-load-misses # 39,53% of all iTLB cache hits
120 254 iTLB-loads
70 332 mem_inst_retired.stlb_miss_loads
8 435 mem_inst_retired.stlb_miss_stores
# High latency use-case
1 086 543 dTLB-load-misses # 0,19% of all dTLB cache hits
566 052 409 dTLB-loads
598 092 dTLB-store-misses
321 672 512 dTLB-stores
834 482 iTLB-load-misses # 443,76% of all iTLB cache hits
188 049 iTLB-loads
986 330 mem_inst_retired.stlb_miss_loads
93 237 mem_inst_retired.stlb_miss_stores
The dTLB is a per-core TLB used for storing the mapping of data pages. The sTLB is shared between cores. The iTLB is a per-core TLB used for storing the mapping of code pages.
We can see a huge increase of the number of dTLB load misses and iTLB load misses as well as sTLB loads/stores. This confirms the performance issue is likely caused by TLB misses.
TLB misses should result in more cache misses decreasing performance. This is what we can see in practice. Indeed, here are performance results of the caches:
# Low latency use-case
463 214 319 mem_load_retired.l1_hit
4 184 769 mem_load_retired.l1_miss
2 527 800 mem_load_retired.l2_hit
1 659 963 mem_load_retired.l2_miss
1 568 506 mem_load_retired.l3_hit
96 549 mem_load_retired.l3_miss
# High latency use-case
558 344 514 mem_load_retired.l1_hit
7 280 721 mem_load_retired.l1_miss
3 564 001 mem_load_retired.l2_hit
3 720 610 mem_load_retired.l2_miss
3 547 260 mem_load_retired.l3_hit
105 502 mem_load_retired.l3_miss
Another part of the overhead is due to conditional jumps being less well predicted after a long sleep. This is a complex topic, but one should know that branches are predicted by mainstream modern processors based on many parameters including the past result. For example, if a condition is always true, then the processor can speculatively execute the condition and revert it later if the prediction was actually wrong (expensive). Modern processors cannot predict a large set of conditional jumps simultaneously : they have a small cache for this and this can can quickly get flushed over time. The thing is CPython does a lot of conditional jumps like most interpreters. Thus, a context-switch can likely cause a flush of the branch jump cache, increasing the overhead of conditional jumps used in this case, resulting in a higher latency.
Here are experimental results on my machine:
# Low latency use-case
350 582 368 branch-instructions
4 629 149 branch-misses # 1,32% of all branches
# High latency use-case
421 207 541 branch-instructions
8 392 124 branch-misses # 1,99% of all branches
Note that a branch miss should takes ~14 cycles on my machine. This means a gap of 14 ms and so ~1400 ns per iteration. That being said, only a fraction of this time is measured between the two function calls to get_time_ns()
.
For more information about this topic, please read this post.