Search code examples
rustipcshared-memorymmapapache-arrow

What's the best practice for swap apache arrow data between different processes?


I have a data api which could get stream data use rust as an independent service process, and plan to write several client process to read the data, because the client process have some function based on apache arrow datatype.

I think this might be a single producer multi consumer problem. What's the best practice for swap these apache arrow data between different processes with low latency?


Solution

  • The easiest way to do this is to send the file across a socket, preferably with something like Flight. That would be my recommendation until you prove that performance is insufficient.

    Since these are on the same machine you can potentially use something like memory mapped files. You first open a memory mapped file (you'll need to know the size and I'm not sure exactly how to do this but you can easily find the size of the buffers and you can just make a conservative guess for how much you'll need for metadata) and then write to that file on the producer. Make sure to write data using the Arrow format with no compression. This will involve one memory copy from user space to kernel space. You would then send the filename to the consumers over a socket.

    Then, on the consumers, you can open the file as a memory mapped file and work with the data (presumably in a read-only fashion since there are multiple consumers). This read will be zero-copy as the file should still be in the kernel's cache. If you are in a situation where the kernel is swapping however then this is likely to backfire.