sockets ipc zeromq android-binder zero-copy

How many copies are needed for a ZeroMQ IPC request?

I am investigating how many times a copy happens in an IPC request. This helps me decide the best solution for my mobile device.

For example, shared memory is zero-copy, since processes can directly exchange messages.

Most IPC methods used on Linux (e.g., socket) need two copies: user->kernel->user.

The Binder in Android needs only one copy: from the user (sender) to the kernel space. Then it will utilize mmap() to avoid a second copy.

As far as I know, the IPC connection in ZeroMQ is based on the UNIX domain socket. So does this mean the copy is unavoidable? If so, how many copies are needed?

Solution

It might be 4 or 5:

You write data into a zmq message
The zmq message is copied by ZMQ on calling zmq_send()
Copied again into a pipe (as created by a call to pipe()).
Copied again out of the pipe into a zmq message
Optionally copied again out of the zmq message into whatever object the message data represents

Step 5 is omitted if you consume it directly e.g. if it's a string and you can use it in place.

The Micro Electronics in the CPU

However, it's worth pausing for a moment and considering what a "copy" is, and what's actually going on down at the microelectronics levels with shared memory.

The model we have of shared memory is that data is stored at some address, and all cores can equally access that data at that address. However, that's not strictly the case. The data has to be copied into a core's L1 cache before it can be processed. So the overall transaction in the microelectronics could be:

Copy data from L1 to L2, L3, main memory
Copy data across some inter-core network (which will involve its own buffers)
Copy data into the L1 of the core that "addressed" the shared memory.

Something like this will happen on pretty much any modern CPU, mobile or desktop.

So as you can see, there's actually quite a lot of copying of data going on to access shared memory, even though the programming language's memory model disguises that. Note that quite a lot of this is the same, if the application software is making a copy of data.

Throw in the fact that the inter-core network is kept busy with cache coherency traffic for shared memory, and it makes for a busy chip.

Note that this cache coherency traffic is absent in the ZMQ Actor model, because each process has it's own separate copy of the data and no other process is accessing it / caching it too.

Also it's complicated if the destination thread gets scheduled on the same core as the origin thread, because in that case the data is quite possibly still in the core's L1 cache.

So, it's partly a matter of "luck", application and OS design as to what actually happens.