Read in large text file (~20m rows), apply function to rows, write to new text file

I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.

My code looks like this:

with open('f.txt', 'r') as f:
    function(f,w)

Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.

I have tried:

def multiprocess(f,w):    
    cores = multiprocessing.cpu_count()

    with Pool(cores) as p:
        pieces = p.map(function,f,w)
    
    f.close()
    w.close()

multiprocess(f,w)

But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.

Solution

even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.

In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process

The slowest solution, but most efficient RAM usage

Your initial solution to execute and process the file line by line

For maximum speed, but most RAM consumption

Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.

for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the `Queue` class

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue

Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.

Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.

Read in large text file (~20m rows), apply function to rows, write to new text file

The slowest solution, but most efficient RAM usage

For maximum speed, but most RAM consumption

for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class

for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the `Queue` class