python multithreading parallel-processing large-files

How to parse a large file taking advantage of threading in Python?

I have a huge file and need to read it and process.

with open(source_filename) as source, open(target_filename) as target:
    for line in source:
        target.write(do_something(line))

    do_something_else()

Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?

edit: To make this question not a discussion, How should the code look like?

with open(source_filename) as source, open(target_filename) as target:
   ?

@Nicoretti: In an iteration I need to read a line of several KB of data.

update 2: the file may be a bz2, so Python may have to wait for unpacking:

$ bzip2 -d country.osm.bz2 | ./my_script.py

Solution

You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.

import threading
import Queue

QUEUE_SIZE = 1000
sentinel = object()

def read_file(name, queue):
    with open(name) as f:
        for line in f:
            queue.put(line)
    queue.put(sentinel)

def process(inqueue, outqueue):
    for line in iter(inqueue.get, sentinel):
        outqueue.put(do_something(line))
    outqueue.put(sentinel)

def write_file(name, queue):
    with open(name, "w") as f:
        for line in iter(queue.get, sentinel):
            f.write(line)

inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)

threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)

It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.