How much danger is there from starting too many processes with Popen() before the initial Popens have resolved?
I am doing some processing on a directory filled with PDFs. I iterate over each file and do two things using external calls.
First, I get the an html representation from the Xpdf-based pdftohtml tool (pdfminer is too slow). This makes an output of only the first page:
html = check_output(['pdftohtml.exe','-f','1','-l','1','-stdout','-noframes',pdf])
then if my conditions are met (I identify that it is the right document), I call tabula-extractor on it to extract a table. This is a slow/long running process compared to checking the document and only happens on maybe 1/20 files.
if I just do call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', .....])
, I will spend a long time waiting for the extraction to complete while I could be checking more files (I've got 4 cores and 16gb of ram and Tabula doesn't seem to multithread).
So instead, I am using Popen() to avoid blocking.
Popen(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', '-o', csv, '-f', 'CSV', '-a', "'",topBorder, ',', leftBorder, ',', bottomBorder, ',', rightBorder, "'", '-p', '1', pdf])
#where CSV is the name of the output file and pdf is the name of the input
I don't care about the return value (tabula is creating a csv file, so I can always see after the fact if it was created sucessfully). Doing it this way means that I can keep checking files in the background and starting more tabula processes as needed (again, only about 1 in 20).
This works, but it gets backlogged and ends up running a ton of tabula processes at once. So my questions are: Is this bad? It makes the computer slow for anything else, but as long as it doesn't crash and is working as fast as it can, I don't really mind (all 4 cores sit at 100% the whole time, but memory usage doesn't go above 5.5GB, so it appears CPU-bound).
If it is bad, what is the right way to improve it? Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?
Is there a convenient way to say, queue up tabula processes so there are always 1-2 running per core, but I am not trying to process 30 files at once?
Yes, the multiprocessing
module does just that.
import multiprocessing
import subprocess
def process_pdf(path):
subprocess.call(['jruby', 'C:\\jruby-1.7.4\\bin\\tabula', path, ...])
pool = multiprocessing.Pool(3) # 3 processes
results = []
for path in search_for_files():
results.append(pool.apply_async(process_pdf, [path]))
for result in results:
result.wait()