Search code examples
pythonmultithreadingtabula

Throttling Popen() calls


How much danger is there from starting too many processes with Popen() before the initial Popens have resolved?

I am doing some processing on a directory filled with PDFs. I iterate over each file and do two things using external calls.

First, I get the an html representation from the Xpdf-based pdftohtml tool (pdfminer is too slow). This makes an output of only the first page:

html = check_output(['pdftohtml.exe','-f','1','-l','1','-stdout','-noframes',pdf])

then if my conditions are met (I identify that it is the right document), I call tabula-extractor on it to extract a table. This is a slow/long running process compared to checking the document and only happens on maybe 1/20 files.

if I just do call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', .....]), I will spend a long time waiting for the extraction to complete while I could be checking more files (I've got 4 cores and 16gb of ram and Tabula doesn't seem to multithread).

So instead, I am using Popen() to avoid blocking.

Popen(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', '-o', csv, '-f', 'CSV', '-a', "'",topBorder, ',', leftBorder, ',', bottomBorder, ',', rightBorder, "'", '-p', '1', pdf]) 
#where CSV is the name of the output file and pdf is the name of the input

I don't care about the return value (tabula is creating a csv file, so I can always see after the fact if it was created sucessfully). Doing it this way means that I can keep checking files in the background and starting more tabula processes as needed (again, only about 1 in 20).

This works, but it gets backlogged and ends up running a ton of tabula processes at once. So my questions are: Is this bad? It makes the computer slow for anything else, but as long as it doesn't crash and is working as fast as it can, I don't really mind (all 4 cores sit at 100% the whole time, but memory usage doesn't go above 5.5GB, so it appears CPU-bound).

If it is bad, what is the right way to improve it? Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?


Solution

  • Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?

    Yes, the multiprocessing module does just that.

    import multiprocessing
    import subprocess
    
    def process_pdf(path):
        subprocess.call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', path, ...])
    
    pool = multiprocessing.Pool(3)      # 3 processes
    results = []
    for path in search_for_files():
        results.append(pool.apply_async(process_pdf, [path]))
    for result in results:
        result.wait()