Search code examples
c++file-iostreamingposixdisk-io

High performance ways to stream local files as they're being written to network


Today a system exists that will write packet-capture files to the local disk as they come in. Dropping these files to local disk as the first step is deemed desirable for fault-tolerance reasons. If a client dies and needs to reconnect or be brought up somewhere else, we enjoy the ability to replay from the disk.

The next step in the data pipeline is trying to get this data that was landed to disk out to remote clients. Assuming sufficient disk space, it strikes me as very convenient to use the local disk (and the page-cache on top of it) as a persistent boundless-FIFO. It is also desirable to use the file system to keep the coupling between the producer and consumer low.

In my research, I have not found a lot of guidance on this type of architecture. More specifically, I have not seen well-established patterns in popular open-source libraries/frameworks for reading the file as it is being written to stream out.

My questions:

  1. Is there a flaw in this architecture that I am not noting or indirectly downplaying?

  2. Are there recommendations for consuming a file as it is being written, and efficiently blocking and/or asynchronously being notified when more data is available in the file?

  3. A goal would be to either explicitly or implicitly have the consumer benefit from page-cache warmth. Are there any recommendations on how to optimize for this?


Solution

  • The file-based solution sounds clunky but could work. Similarly to how tail -f does it:

    • read the file until EOF, but not close it
    • setup an inode watch (with inotify), waiting for more writes
    • repeat

    The difficulty is usually with file rotation and cleanup, i.e. you need to watch for new files and/or truncation.

    Having said that, it might be more efficient to connect to the packet-capture interface directly, or setup a queue to which clients can subscribe.