I've a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to this function, by using multiprocessing. So, when I have a single list (i.e. no multiprocessing), each image of the list took ~ 1s, but when I increased the lists that had to be processed parallely to 4, each image took an astounding 13s.
To understand where the problem really is, I tried to create a minimal working example of the problem. Here I have two functions eat25
and eat100
which open an image name
and feed it to the OCR, that uses the API pytesseract
. eat25
does it 25 times, and eat100
does it 100 times.
My aim here is to run eat100
without multiprocessing, and eat25
with multiprocessing (with 4 processes). This, theoretically, should take 4 times less time that eat100
if I have 4 separate processors (I have 2 cores with 2 threads per core, thus CPU(s) = 4 (correct me if I'm wrong here)).
But all theory laid wasted when I saw that the code didn't even respond after printing "Processing 0" 4 times. The single processor function eat100
worked fine though.
I had tested a simple range cubing function, and it did work well with multiprocessing, so my processors do work well for sure. The only culprits here could be:
pytesseract
: See this`
from pathos.multiprocessing import ProcessingPool
from time import time
from PIL import Image
import pytesseract as pt
def eat25(name):
for i in range(25):
print('Processing :'+str(i))
pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
def eat100(name):
for i in range(100):
print('Processing :'+str(i))
pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
st = time()
eat100('normalBox.tiff')
en = time()
print('Direct :'+str(en-st))
#Using pathos
def caller():
pool = ProcessingPool()
pool.map(eat25,['normalBox.tiff','normalBox.tiff','normalBox.tiff','normalBox.tiff'])
if (__name__=='__main__'):
caller()
en2 = time()
print('Pathos :'+str(en2-en))
So, where the problem really is? Any help is appreciated!
EDIT:
The image normalBox.tiff
can be found here. I would be glad if people reproduce the code and check if the problem continues.
I'm thepathos
author. If your code takes 1s
to run serially, then it's quite possible that it will take longer to run in naive process parallel. There is overhead to working with naive process parallel:
I'd suggest checking a few simple things to check where your issues might be:
pathos.pools.ThreadPool
to use thread parallel instead of process parallel. This can reduce some of the overhead for serialization and spinning up the pool.pathos.pools._ProcessPool
to change how pathos
manages the pool. Without the underscore, pathos
keeps the pool around as a singleton, and requires a 'terminate' to explicitly kill the pool. With the underscore, the pool dies when you delete the pool object. Note that your caller
function does not close
or join
(or terminate
) the pool.dill.dumps
one of the elements you are trying to process in parallel. Things like big numpy
arrays can take a while to serialize. If the size of what is being passed around is large, you might consider using a shared memory array (i.e. a multiprocess.Array
or the equivalent version for numpy
arrays -- also see: numpy.ctypeslib
) to minimize what is being passed between each process.The latter is a bit more work, but can provide huge savings if you have a lot to serialize. There is no shared memory pool, so you have to do a for loop over the individual multiprocess.Process
objects if you need to go that route.