python performance multiprocessing queue ipc

Python multiprocessing Queue: performing get() is a bottleneck

I create event generator in Python.

I have 3 modules (each module is a separate Process):

Input module: which wakes up in different moments of time and then puts timestamp of moment in Input Queue;
Event module: which gets timestamp from Input Queue, render event with that timestamp and send it to Event Queue;
Output module: which gets rendered events from Event Queue and write it to some endpoint;

The problem of performance with high EPS values is in Event module.

The problem is the Queue.get() call consumes the most time of execution. The Input Queue is filled in advance so Input Queue is all time available to perform get in following example.

Here is some simplified code with performance profiling:


spent_on_render = 0.0
spent_on_putting = 0.0
spent_on_getting = 0.0
spent_summary = 0.0

events_batch = []
last_flush = perf_counter()

while True:
    start = perf_counter()

    if (
        len(events_batch) >= FLUSH_AFTER_SIZE
        or (perf_counter() - last_flush) > FLUSH_AFTER_SECONDS
    ):
        putting_time = perf_counter()

        for event in events_batch:
            event_queue.put(event)

        spent_on_putting += perf_counter() - putting_time

        events_batch.clear()
        last_flush = perf_counter()

    try:
        get_time = perf_counter()
        timestamp = input_queue.get(
            block=False,
            timeout=0
        )
        spent_on_getting += perf_counter() - get_time
    except Empty:
        spent_summary += perf_counter() - start
        continue

    render_time = perf_counter()
    events_batch.append(event_plugin.produce(timestamp=timestamp))
    spent_on_render += perf_counter() - render_time
    

    spent_summary += perf_counter() - start

The results of some time of execution are next:

=================================
Spent on render: 12.241767558232823
Spent on putting: 0.326323675999447
Spent on getting: 22.14875863072848
Spent summary: 35.32571034637658

The moment that confusing me is that if I'm just get from Queue in cycle with no any additional operations the time for read the same amount of timestamps from queue is reduced many times over.

So what can I do in described situation to bust performance or may be I should use different way of IPC in my case?

UPD:

Here is some extra research: When I first read all input queue and then put it to next queue:

timestamps = []
start = perf_counter()
qsize = input_queue.qsize()
for _ in range(qsize):
    timestamps.append(input_queue.get())

print('Size:', qsize)
print('Time spent after get:', perf_counter() - start)

for ts in timestamps:
    event_queue.put(ts)

print('Time spent after put:', perf_counter() - start)

Output is:

Size: 1000000
Time spent after get: 1.935170126002049
Time spent after put: 2.2695002569998906

When I read element and at the same iteration put it to next queue:

start = perf_counter()
qsize = input_queue.qsize()
for _ in range(qsize):
    ts = input_queue.get()
    event_queue.put(ts)

print('Size:', qsize)
print('Time spent after get and put:', perf_counter() - start)

Output is:

Size: 1000000
Time spent after get and put: 16.109829995999462

Why this difference is so big?

Solution

The problem was related with context switches. I set greater value for batch size and got needed performance.