Latency after moving micro-service (using ZeroMQ, C, & Python processes) from 64 bit hardware to 32 bit hardware, but nominal cpu usage

I have two processes wrote in C that set up PUSH/PULL ZeroMQ sockets and two threads in a Python process that mirror the PUSH/PULL sockets. There are roughly 80 - 300 light weight (<30 bytes) messages per second being sent from the C process to the Python process, and 10-30 similar messages from the Python process to the C process.

I was running these services on 64 bit ARMv8 (Ubuntu based) and AMD64 (Ubuntu 18.04) with no noticeable latency. I tried running the exact same services on a 32 bit Linux based system and was shocked to see messages coming through over 30 seconds behind, even after killing the C services. When checking the CPU usage, it was pretty flat 30-40% and didn't appear to be the bottle neck.

My ZeroMQ socket settings didn't change between systems, I set LINGER to 0, I tried RCVTIMEO between 0 to 100 ms, and I tried varying BACKLOG between 0 and 50, with no difference either way. I tried using multiple IO threads and setting socket thread affinity, also to no avail. For the PUSH sockets I'm connecting the sockets on tcp://localhost:##### and binding the PULL sockets to tcp://*:#####. I also used ipc:///tmp/..., messages were being sent and received, but the latency still existed on the 32 bit system.

I investigated other Python steps in-between receiving the messages, and they don't appear to be taking more than a millisecond at most. When I time the socket.recv(0) it's as high as 0.02 seconds even when the RCVTIMEO is set to 0 for that socket.

Any suggestions why I would see this behaviour on the new 32 bit platform and not on other platforms? Am I possibly looking in all the wrong places?

Here's a bit of code to help explain:

The connection and the _recv() class-method are roughly depicted below:

    def _connect(self):
        self.context = zmq.Context(4)
        self.sink = self.context.socket(zmq.PULL)
        self.sink.setsockopt(zmq.LINGER, 0)
        self.sink.setsockopt(zmq.RCVTIMEO, 100)
        self.sink.setsockopt(zmq.BACKLOG, 0)
        self.sink.bind("tcp://*:55755")

    def _recv(self):
        while True:
            msg = None
            try:
                msg = self.sink.recv(0)  # Use blocking or zmq.NOBLOCK, still appears to be slow
            except zmq.Error
                ... meaningful exception handle here

            # This last step, when timed usually takes less than a millisecond to process
            if msg:
                msg_dict = utils.bytestream_to_dict(msg)  # unpacking step (negligible)
                if msg_dict:
                    self.parser.parse(msg_dict)  # parser is a dict of callbacks also negligible

On the C process side

    zmq_init (4);

    void *context = zmq_ctx_new ();

    /* Connect the Sender */
    void *vent = zmq_socket (context, ZMQ_PUSH);

    int timeo = 0;
    int timeo_ret = zmq_setsockopt(vent, ZMQ_SNDTIMEO, &timeo, sizeof(timeo));
    if (timeo_ret != 0)
        error("Failed to set ZMQ recv timeout because %s", zmq_strerror(errno));

    int linger = 100;
    int linger_ret = zmq_setsockopt(vent, ZMQ_LINGER, &linger, sizeof(linger));
    if (linger_ret != 0)
        error("Failed to set ZMQ linger because %s", zmq_strerror(errno));

    if (zmq_connect (vent, vent_port) == 0)
        info("Successfully initialized ZeroMQ ventilator on %s", vent_port);
    else {
        error("Failed to initialize %s ZeroMQ ventilator with error %s", sink_port, 
               zmq_strerror(errno));
        ret = 1;
    }

    ...

    /* When a message needs to be sent it's instantly hitting this where msg is a char* */
    ret = zmq_send(vent, msg, msg_len, ZMQ_NOBLOCK);

On docker running on target 32 bit system lstopo - -v --no-io

Machine (P#0 local=1019216KB total=1019216KB HardwareName="Freescale i.MX6 Quad/DualLite (Device Tree)" HardwareRevision=0000 HardwareSerial=0000000000000000 Backend=Linux LinuxCgroup=/docker/d2b0a3b3a5eedb7e10fc89fdee6e8493716a359597ac61350801cc302d79b8c0 OSName=Linux OSRelease=3.10.54-dey+g441c8d4 OSVersion="#1 SMP PREEMPT RT Tue Jan 28 12:11:37 CST 2020" HostName=db1docker Architecture=armv7l hwlocVersion=1.11.12 ProcessName=lstopo)
  Package L#0 (P#0 CPUModel="ARMv7 Processor rev 10 (v7l)" CPUImplementer=0x41 CPUArchitecture=7 CPUVariant=0x2 CPUPart=0xc09 CPURevision=10)
    Core L#0 (P#0)
      PU L#0 (P#0)
    Core L#1 (P#1)
      PU L#1 (P#1)
    Core L#2 (P#2)
      PU L#2 (P#2)
    Core L#3 (P#3)
      PU L#3 (P#3)
depth 0:        1 Machine (type #1)
 depth 1:       1 Package (type #3)
  depth 2:      4 Core (type #5)
   depth 3:     4 PU (type #6)

EDIT:

We were able to make the latency disappear on our target machine by disabling nearly all other worker threads.

Solution

Q : roughly 80 - 300 light weight (<30 bytes) messages per second being sent from the C process to the Python process, and 10-30 similar messages from the Python process to the C process.

a ) there is Zero information about sending any messages from python to C ( not about this contained in the posted source code, only C PUSH-es to python )

b ) 300 [Hz] < 30 B payloads are nothing, in terms of ZeroMQ capabilities

c ) Python is, since ever ( and almost for sure will remain so ), a pure-[SERIAL] in sense, whatever amount of Thread-instances, so any the execution will have to wait till it gets POSACK'ed the GIL-lock ownership, blocking any other work, thus re-instating a pure-[SERIAL] work one-step-after-another... yet at an additional costs of GIL-lock handshaking added.

d ) given all processes run on the same hardware-platform ( see the tcp://localhost... specified ), there is no reason to spawn as many as ( 4 + 4 )-IO-threads ( where Python cannot "harness" them on reading by a just one and a single-thread at a time (slo-mo), given no more but 4-CPU-cores were reported above by an lstopo excerpt:

Machine (995MB)
+Package L#0
 Core L#0 +PU L#0 (P#0)
 Core L#1 +PU L#1 (P#1)
 Core L#2 +PU L#2 (P#2)
 Core L#3 +PU L#3 (P#3)

e) ISO-OSI-L2/L3 parameters make sense to tweak but after all larger sources of latency got shaved off.

f) last, but not least, run python pystone test ( on both the original platform and the target-hardware platform ), to see the actual relative performance of the i.MX6-CPU-powered Python, to become able to compare apples to apples

_{Running pystone on the target machine results in: This machine benchmarks at 10188.5 pystones/second and my host machine is 274264 pystones/second}

So, the problem with the deployment onto an i.MX6-target is not just its 32-bit O/S per se, but also the 27x slower processing of over-subscribed IO-threads ( more threads 4+4 than 4-CPU-cores ) do not improve the flow of messages.

A better view, served by lstopo-no-graphics -.ascii

    ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │ Machine (31876MB)                                                                                                 │
    │                                                                                                                   │
    │ ┌────────────────────────────────────────────────────────────┐                      ┌───────────────────────────┐ │
    │ │ Package P#0                                                │  ├┤╶─┬─────┼┤╶───────┤ PCI 10ae:1F44             │ │
    │ │                                                            │      │               │                           │ │
    │ │ ┌────────────────────────────────────────────────────────┐ │      │               │ ┌────────────┐  ┌───────┐ │ │
    │ │ │ L3 (8192KB)                                            │ │      │               │ │ renderD128 │  │ card0 │ │ │
    │ │ └────────────────────────────────────────────────────────┘ │      │               │ └────────────┘  └───────┘ │ │
    │ │                                                            │      │               │                           │ │
    │ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │               │ ┌────────────┐            │ │
    │ │ │ L2 (2048KB)              │  │ L2 (2048KB)              │ │      │               │ │ controlD64 │            │ │
    │ │ └──────────────────────────┘  └──────────────────────────┘ │      │               │ └────────────┘            │ │
    │ │                                                            │      │               └───────────────────────────┘ │
    │ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │                                             │
    │ │ │ L1i (64KB)               │  │ L1i (64KB)               │ │      │               ┌───────────────┐             │
    │ │ └──────────────────────────┘  └──────────────────────────┘ │      ├─────┼┤╶───────┤ PCI 10bc:8268 │             │
    │ │                                                            │      │               │               │             │
    │ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │               │ ┌────────┐    │             │
    │ │ │ L1d (16KB) ││ L1d (16KB) │  │ L1d (16KB) ││ L1d (16KB) │ │      │               │ │ enp2s0 │    │             │
    │ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │               │ └────────┘    │             │
    │ │                                                            │      │               └───────────────┘             │
    │ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │                                             │
    │ │ │ Core P#0   ││ Core P#1   │  │ Core P#2   ││ Core P#3   │ │      │     ┌──────────────────┐                    │
    │ │ │            ││            │  │            ││            │ │      ├─────┤ PCI 1002:4790    │                    │
    │ │ │ ┌────────┐ ││ ┌────────┐ │  │ ┌────────┐ ││ ┌────────┐ │ │      │     │                  │                    │
    │ │ │ │ PU P#0 │ ││ │ PU P#1 │ │  │ │ PU P#2 │ ││ │ PU P#3 │ │ │      │     │ ┌─────┐  ┌─────┐ │                    │
    │ │ │ └────────┘ ││ └────────┘ │  │ └────────┘ ││ └────────┘ │ │      │     │ │ sr0 │  │ sda │ │                    │
    │ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │     │ └─────┘  └─────┘ │                    │
    │ └────────────────────────────────────────────────────────────┘      │     └──────────────────┘                    │
    │                                                                     │                                             │
    │                                                                     │     ┌───────────────┐                       │
    │                                                                     └─────┤ PCI 1002:479c │                       │
    │                                                                           └───────────────┘                       │
    └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘