I have a requirement wherein I have to buffer a lot of data (in GBs), for future use. Since there isn't enough RAM available for buffering such huge amount of data, I decided to go for storing the data in a file.
Now the pitfall here is that while I am writing the data to the file, other threads might need that "buffered" data and so I have to flush the file stream every time I write something to it. Precisely, the data is video frames that I buffer as pre-recorded data (like a TiVo)
and other threads may or may not want to write it at any given point in time, but when they do, they fread
from the file and process the frames.
In the general case, the fwrite
-fflush
combo takes around 150 us but occasionally (and fairly regularly), the combo takes more than 1.5 seconds. I can't afford this as I have to process frames in real-time.
I have many questions here:
Is my approach of buffering data in the file correct? What alternatives do I have?
Any idea why the fwrite-fflush operation suddenly takes more time on some occasions? Note that it reverts back to 150 us after taking 1.5 seconds once.
As for #2: Most modern file systems use a btree approach to manage the amount of directory and data nodes in todays huge HDs. As with all btrees, they need to be balanced sometimes. While that happens, no changes must happen, so that's why the system locks up. Usually, it's not a big deal because of the huge caches of the OS but you're a corner case where it hurts.
What can you do about it? There are two approaches:
Use sockets to communicate and keep the last N frames in RAM (i.e. never write them to disk or use an independent process to write it to disk).
Don't write a new file, overwrite an existing file. Since the location of all data blocks is known in advance, there will be no reorg in the FS while you write. It will also be a little bit faster. So the idea is to create a huge file or use a raw partition and then overwrite it. When you hit the end of the file, seek back to the start and repeat.
Drawbacks:
With approach #1, you can lose frames. Also, you must make absolutely sure that all clients can read and process the data fast enough or the server might block.
With #2, you must find a way to tell the readers where the current "end of file" is.
So maybe a mixed approach is best:
Consider using memory mapped files; that will make everything a bit more simple.