Search code examples
pythonmultithreadingparallel-processinglarge-files

How to parse a large file taking advantage of threading in Python?


I have a huge file and need to read it and process.

with open(source_filename) as source, open(target_filename) as target:
    for line in source:
        target.write(do_something(line))

    do_something_else()

Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?

edit: To make this question not a discussion, How should the code look like?

with open(source_filename) as source, open(target_filename) as target:
   ?

@Nicoretti: In an iteration I need to read a line of several KB of data.

update 2: the file may be a bz2, so Python may have to wait for unpacking:

$ bzip2 -d country.osm.bz2 | ./my_script.py

Solution

  • You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.

    import threading
    import Queue
    
    QUEUE_SIZE = 1000
    sentinel = object()
    
    def read_file(name, queue):
        with open(name) as f:
            for line in f:
                queue.put(line)
        queue.put(sentinel)
    
    def process(inqueue, outqueue):
        for line in iter(inqueue.get, sentinel):
            outqueue.put(do_something(line))
        outqueue.put(sentinel)
    
    def write_file(name, queue):
        with open(name, "w") as f:
            for line in iter(queue.get, sentinel):
                f.write(line)
    
    inq = Queue.Queue(maxsize=QUEUE_SIZE)
    outq = Queue.Queue(maxsize=QUEUE_SIZE)
    
    threading.Thread(target=read_file, args=(source_filename, inq)).start()
    threading.Thread(target=process, args=(inq, outq)).start()
    write_file(target_filename, outq)
    

    It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.