mpi4py - Slow file read-write in parallel

I am trying to separate a huge (probe) ascii-format file of size ~15G into smaller formatted (cloud) files.

NOTE: Just for reference, the probe and cloud files correspond to the two different ways of sampling data in OpenFOAM, as discussed in Section 6.3.1 of the user guide.

For this I use mpi4py so that the lines of the probe file can be processed across several processors, with the objective to reduce the overall runtime.

The part of the code where the conversion takes place looks like this (link to full script below):

#- Read line and transform to column
#- Definitions:
#-    nt        = Total number of lines to process
#-    ntPerRank = Number of lines assigned to the current rank
#-    nHeader   = Number of header lines (to skip)
#-    probeFILE = File containing data to be formated
#-    cloudFILE = Output file

timeIndFull = np.arange(nt)
ntPerRank = len(timeIndPerRank)

for i in range(ntPerRank):
    #- Get line number in the probe file assigned to the current rank
    lineInRank = timeIndFull[timeIndPerRank[i]]+nHeader
    #- Read the probe file
    with open(probeFILE, 'r') as f:
        for lineNumber, line in enumerate(f):
            if lineNumber == lineInRank:
                #- Split the line and get the relevant elements
                fieldStr = line.split()[1:]
                #- Write to cloud file with each element in a new line
                with open(cloudFILE, 'w') as fout:
            #- Move to the next line in probe file assigned to the current rank
            elif lineNumber > lineInRank:

I have an apprehension that this part of the code - i.e. the sequence of file reading, formating, and file writing - can be improved or altered to reduce the overall runtime. I am not sure how as I have very less experience with code optimization so it will be very helpful if you can suggest some points of improvement.

I have shared the full script and a dummy data generator here:

Currently, for the probe file with ~1000 lines to process and each line containing 1000000 elements, the processing time on 64 processors is over 3h45m!

NOTE: The calculations are run on a cluster with Intel® Xeon® Gold SKL-6130 @ 2.1 GHz processors.

To reduce the overall runtime, is it just a matter of increasing the number of processors or is it possible to do the same sequence of operations more efficiently? Thanks.


  • Consider using isslice from itertools. Also, iterating over the entire file for every line number assigned to a rank is O(N²) which is definitely contributing to your runtime woes. Store the targeted line numbers for the rank first. Then, open the file and loop over the list of line numbers. For each line n, retrieve the line with for line in islice(f, n - 1, n). This will increase CPU usage a bit but the significant decrease in runtime should make it a worthwhile trade-off. Hope I didn't miss something since I can't run it.

    from itertools import islice
    timeIndFull = np.arange(nt)
    ntPerRank = len(timeIndPerRank)
    linesInRank = [timeIndFull[timeIndPerRank[i]]+nHeader+1 
                        for i in range(ntPerRank)]
    def format_line(line):
        #- Write to cloud file with each element in a new line
        with open(cloudFILE, 'w') as fout:
            fieldStr = line.split()[1:]
    #- Option [1] - Using islice
    for n in linesInRank:
        with open(probeFILE, 'r') as f:
            for line in islice(f, n-1, n):

    Another option can be to store all the lines to a dictionary:

    #- Option [2] - Using dictionary
    def get_file_line_dict(lines):
        return {k: v for k, v in enumerate(lines)}
    with open(probeFILE, 'r') as f:
        lines = f.readlines()
    fileLines = get_file_line_dict(lines)
    for n in linesInRank: