python performance python-multiprocessing ctypes mmap

Reasoning behind high latency when using python ctypes during process interrupts

While investigating a critical path in our python codebase, we found out that the behaviour of ctypes in terms of latencies is quite unpredictable.

A bit more background of our application. We have bunch of processes where each of them communicate through shared memory. We leverage python library multiprocessing.RawValue and multiprocessing.RawArray which internally uses ctypes for data management. While running this in production, we saw that even a simple get() access on these shared data types takes around 30-50 us and sometimes 100us and that's quite slow. Even for python.

I have created this bare bone example which creates a ctype structure and exposes get() method

import ctypes
import sys
import time
import numpy as np
import random
from decimal import Decimal

def get_time_ns():
    return Decimal(str(time.time_ns()))

class Point(ctypes.Structure):
    _fields_ = [("x", ctypes.c_int),
                ("y", ctypes.c_int)]

    def __init__(self, x, y):
        return super().__init__(x, y)

    def get(self):
        return self.x
        #return str(self.x) + "," + str(self.y)

def benchmark(delay_mode):
    p = Point(10, 20)
    iters = 10
    while iters:
        start_ts = get_time_ns()
        _ = p.get()
        end_ts = get_time_ns()
        print("Time: {} ns".format(end_ts - start_ts))
        iters -= 1
        if delay_mode == 1:
            time.sleep(random.uniform(0, 0.1))

benchmark(int(sys.argv[1]))

When I run this in no sleep mode, The latency numbers are as follows

[root@centos-s-4vcpu-8gb-fra1-01 experiments]# python3.9 simple_ctype.py 0
Time: 9556 ns
Time: 2246 ns
Time: 1124 ns
Time: 1174 ns
Time: 1091 ns
Time: 1126 ns
Time: 1081 ns
Time: 1066 ns
Time: 1077 ns
Time: 1138 ns

And when I run this in sleep mode, the latency numbers are as follows

[root@centos-s-4vcpu-8gb-fra1-01 experiments]# python3.9 simple_ctype.py 1
Time: 27233 ns
Time: 27592 ns
Time: 31687 ns
Time: 32817 ns
Time: 26234 ns
Time: 32651 ns
Time: 29468 ns
Time: 36981 ns
Time: 31313 ns
Time: 34667 ns

The reason for using sleep is to simulate our production environment where application is doing more than just running this loop

Can someone explain me the reason for this 10 - 20X increase in latency when there are interrupts compared to the above hot loop. My best guess is CPU cache miss but that still does not explain such latency increase. I am also quite confused on how ctypes actually manage memory. Is it just plain malloc or mmap and malloc. And last but not least, It would be really great if someone can help us optimise this.

System Information: CentOS 7.9, 4 core CPU, 16 GB RAM. taskset to pin specific CPU core to the script

FYI, We already know that C++/Rust is better for this high precision performance than high level language like python but considering the time sensitivity and other business reasons, we would like to optimise our python code for performance before we actually hit the language barrier

Solution

There are multiple reason for a code to be slower sleeping. Here, the 4 main reasons are the frequency scaling, the TLB/cache misses and the branch misses. All of them are due to context switches mixed with a long period of CPU inactivity. The problem is independent of ctypes.

Frequency scaling

When a mainstream modern processor does not have an intensive task to compute, it automatically reduces its frequency (with the agreement of the operating system which can be configured). It is similar to a human sleep: when you have nothing to do, you can sleep, and when you wake up, it takes some time before you can operating quickly (ie. dizzy state). This is the same thing for the processor: it takes some time for the processor to switch from a low frequency (used during the sleep call) to a high frequency (using during the computing code). AFAIK, this is mainly because the processor needs to adapt its voltage. This is an expected behavior because it is not energy efficient to switch to the highest frequency directly since the target code might not run for a long time (see hysteresis) and the power consumption grow with ~ frequency**3 (due to the voltage increase required by higher frequencies).

There is a way to check that easily on Linux. You can use a fixed frequency and disable any turbo-like mode. On my i5-9600KF processor I used the following line:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

You can check the state of your CPU using the following lines:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
cat /proc/cpuinfo  | grep MHz       # Current frequency for each core

Here is results before and after the changes on my machine:

# ---------- BEFORE ----------

$ python3 lat-test.py 0
Time: 12387 ns
Time: 2023 ns
Time: 1272 ns
Time: 1096 ns
Time: 1070 ns
Time: 998 ns
Time: 1022 ns
Time: 956 ns
Time: 1002 ns
Time: 1378 ns

$ python3 lat-test.py 1
Time: 6941 ns
Time: 3772 ns
Time: 3544 ns
Time: 9502 ns
Time: 25475 ns
Time: 18734 ns
Time: 23800 ns
Time: 9503 ns
Time: 19520 ns
Time: 17306 ns

# ---------- AFTER ----------

$ python3 lat-test.py 0
Time: 7512 ns
Time: 2262 ns
Time: 1488 ns
Time: 1441 ns
Time: 1413 ns
Time: 1434 ns
Time: 1426 ns
Time: 1424 ns
Time: 1443 ns
Time: 1444 ns

$ python3 lat-test.py 1
Time: 8659 ns
Time: 5133 ns
Time: 3720 ns
Time: 4057 ns
Time: 3888 ns
Time: 4187 ns
Time: 4136 ns
Time: 3922 ns
Time: 4456 ns
Time: 3946 ns

We can see that the gap is significantly smaller. Moreover, results are much more stable (and reproducible). Note that the performance is reduced when the latency is small because the turbo-boost has been disabled (so my processor do not run at its highest possible frequency). On my machine the factor between the minimum frequency (0.8 GHz) and the maximum one (4.6 GHz) is 5.75 which is pretty big and justify a significant part of the performance gap when the frequency scaling is enabled (default).

Biased benchmark

A significant part of the latency is lost in the execution of the get_time_ns. This is a critical point: CPython is a slow interpreter so you cannot measure the time very precisely with it. An empty function call in CPython takes ~45 ns on my machine! The expression Decimal(str('1676949210508126547')) takes ~250 ns. This is critical to consider this point since you are measuring a latency only 10 times bigger than this in a context where such a code can be significantly slower due to many overheads (including the cache being cold -- see later).

To improve the accuracy of the benchmark, I removed the use of the Decimal module and the string conversions that are expensive and just used integers. Note that even basic integer are far from being cheap in CPython because they have a variable length and are dynamically allocated, not to mention CPython interpret the bytecode at runtime. A simple integer_1 - integer_2 takes ~35 ns on my machine while it should take less than 1 ns in a native compiled code. Fetching the function time_ns from the time module also takes about the same time, not to mention the timing function itself takes ~50 ns to execute (to ~85 ns in total for 1 fetch+execution).

I also increased the number of iteration from 10 to 10_000 so for the effect to be more visible during the following profiling part.

In the end, the latency was reduced to 200/1300 ns instead of 1400/4000. This is a huge difference. In fact, 200 ns is so small that at least half of the time is still lost timing overheads and is not the call to p.get()! That being said, the gap is still here.

Cache and TLB misses

A part of the remaining overhead is due to cache misses and TLB misses. Indeed, when a context switch happens (due to the call to sleep), the CPU caches can be flushed (somehow). In fact, AFAIK, they are indirectly flush on mainstream modern processors during a context-switch : the TLB CPU unit which is a cache responsible for the translations of virtual memory to physical memory is flushed causing cache lines to be reloaded when the thread is scheduled back. It has a significant impact on performance after the process has been scheduled back because data needs to typically be reloaded from the slow RAM or at least a higher-latency cache (eg. LLC). Note that even if it would not be the case, the thread can be scheduled back on a different core having its own private TLB unit so it would result in many cache misses.

Regarding how memory is shared between processes, you may also experience a "TLB shootdown" which is also pretty expensive. See this post and this one for more information about this effect.

On Linux, we can use the great perf tool so to track performance performance events of the CPU. Here are results for the two use-case for TLBs:

# Low latency use-case

            84 429      dTLB-load-misses          #    0,02% of all dTLB cache hits 
       467 241 669      dTLB-loads                                                  
           412 744      dTLB-store-misses                                           
       263 837 789      dTLB-stores                                                    
            47 541      iTLB-load-misses          #   39,53% of all iTLB cache hits 
           120 254      iTLB-loads 
            70 332      mem_inst_retired.stlb_miss_loads                                   
             8 435      mem_inst_retired.stlb_miss_stores       

# High latency use-case

         1 086 543      dTLB-load-misses          #    0,19% of all dTLB cache hits 
       566 052 409      dTLB-loads                                                  
           598 092      dTLB-store-misses                                           
       321 672 512      dTLB-stores            
           834 482      iTLB-load-misses          #  443,76% of all iTLB cache hits 
           188 049      iTLB-loads 
           986 330      mem_inst_retired.stlb_miss_loads                                   
            93 237      mem_inst_retired.stlb_miss_stores

The dTLB is a per-core TLB used for storing the mapping of data pages. The sTLB is shared between cores. The iTLB is a per-core TLB used for storing the mapping of code pages.

We can see a huge increase of the number of dTLB load misses and iTLB load misses as well as sTLB loads/stores. This confirms the performance issue is likely caused by TLB misses.

TLB misses should result in more cache misses decreasing performance. This is what we can see in practice. Indeed, here are performance results of the caches:

# Low latency use-case

       463 214 319      mem_load_retired.l1_hit                                     
         4 184 769      mem_load_retired.l1_miss                                    
         2 527 800      mem_load_retired.l2_hit                                     
         1 659 963      mem_load_retired.l2_miss    
         1 568 506      mem_load_retired.l3_hit                                     
            96 549      mem_load_retired.l3_miss    

# High latency use-case

       558 344 514      mem_load_retired.l1_hit                                     
         7 280 721      mem_load_retired.l1_miss                                    
         3 564 001      mem_load_retired.l2_hit                                     
         3 720 610      mem_load_retired.l2_miss         
         3 547 260      mem_load_retired.l3_hit                                     
           105 502      mem_load_retired.l3_miss

Branch misses

Another part of the overhead is due to conditional jumps being less well predicted after a long sleep. This is a complex topic, but one should know that branches are predicted by mainstream modern processors based on many parameters including the past result. For example, if a condition is always true, then the processor can speculatively execute the condition and revert it later if the prediction was actually wrong (expensive). Modern processors cannot predict a large set of conditional jumps simultaneously : they have a small cache for this and this can can quickly get flushed over time. The thing is CPython does a lot of conditional jumps like most interpreters. Thus, a context-switch can likely cause a flush of the branch jump cache, increasing the overhead of conditional jumps used in this case, resulting in a higher latency.

Here are experimental results on my machine:

# Low latency use-case

       350 582 368      branch-instructions                                         
         4 629 149      branch-misses             #    1,32% of all branches        

# High latency use-case

       421 207 541      branch-instructions                                         
         8 392 124      branch-misses             #    1,99% of all branches

Note that a branch miss should takes ~14 cycles on my machine. This means a gap of 14 ms and so ~1400 ns per iteration. That being said, only a fraction of this time is measured between the two function calls to get_time_ns().

For more information about this topic, please read this post.