Search code examples
cfilestreamfile-storage

Multiple buffers on the same file


The procedure is as follows.

  1. Filtering a huge File.txt file (FASTQ file format if you are interested) by line by line through file streaming in C.

  2. After each filtering process, the output is a filtered_i.txt file.

  3. Repeat steps 1-2 with 1000 different filters.

  4. The expected results are 1000 filtered_i.txt files, i from 1 to 1000.

The question is:

Can I run these filtering processes in parallel?

My concern is multiple buffers would be opened in File.txt if do parallel. Is it safe to do? Any potential drawbacks?


Solution

  • There is no best answer to your problem: here are some potential issues to take into consideration:

    • opening the same file multiple times for reading in the same or multiple processes does not pose any problems per se, but you might run out of file handles either at the process level or at the system level.
    • if the filters use a lot of RAM for their purpose, running too many of them in parallel may cause swapping, which will significantly slow down the whole process
    • if the file is large but fits in memory, it is likely to stay in the cache, so running filters in sequence would not cause I/O delays, but running them in parallel may take advantage of multiple cores.
    • conversely, if the file does not fit in memory, running filters in parallel should increase overall throughput, especially if they consume the same area of the file at the same time.
    • if the process is I/O bound and filters can consume one line at a time, calling them as functions in sequence in a simple loop in a process that reads one line at a time may be a simple solution. Running multiple such processes in parallel, each handling a subset of all filters can further improve the throughput.

    As for all optimisation problems, you should test different approaches and measure performance.

    Here is a simple script to run 20 filters in parallel:

    #!/bin/bash
    for i in {0..20}; do (for j in {0..50}; do ./filter_$[$j*20+$i+1]; done)& done