Search code examples
pythonmultiprocessingchunking

Read in large text file (~20m rows), apply function to rows, write to new text file


I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.

My code looks like this:

with open('f.txt', 'r') as f:
    function(f,w)

Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.

I have tried:

def multiprocess(f,w):    
    cores = multiprocessing.cpu_count()

    with Pool(cores) as p:
        pieces = p.map(function,f,w)
    
    f.close()
    w.close()

multiprocess(f,w)

But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.


Solution

  • even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.

    In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.

    https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process

    The slowest solution, but most efficient RAM usage

    • Your initial solution to execute and process the file line by line

    For maximum speed, but most RAM consumption

    • Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
    • Figure out the number of cores (say 8 cores for example)
    • Split the list evenly into 8 lists
    • pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
    • Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
    • Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.

    for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class

    https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue

    • Figure out the number of cores in a variable cores (say 8)
    • Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
    • Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
    • In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
    • split that into several lists (10K chunks) to push to your respective Processes Queues
    • When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
    • When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
    • Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.

    Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.