python image-processing parallel-processing pool sentence-transformers

Why does comparing two images take longer when running the procedure in parallel using python's Pool module?

I'm developing a program that involves computing similarity scores for around 480 pairs of images (20 directories with around 24 images in each). I'm utilizing the sentence_transformers Python module for image comparison, and it takes around 0.1 - 0.2 seconds on my Windows 11 machine to compare two images when running in serial, but for some reason, that time gets increased to between 1.5 and 3.0 seconds when running in parallel using a process Pool. So, either a), there's something going on behind the scenes that I'm not yet aware of, or b) I just did it wrong.

Here's a rough structure of the image comparison function:

def compare_images(image_one, image_two, clip_model):
    start = time()
    images = [image_one, image_two]
    # clip_model is set to SentenceTransformer('clip-ViT-B-32') elsewhere in the code
    encoded_images = clip_model.encode(images, batch_size = 2, convert_to_tensor = True, show_progress_bar = False)
    processed_images = util.paraphrase_mining_embeddings(encoded_images)
    stop = time()
    print("Comparison time: %f" % (stop - start) )
    score, image_id1, image_id2 = processed_images[0]
    return score

Here's a rough structure of the serial version of the code to compare every image:

def compare_all_images(candidate_image, directory, clip_model):
    for dir_entry in os.scandir(directory):
        dir_image_path = dir_entry.path
        dir_image = Image.open(dir_image_path)
        similiarity_score = compare_images(candidate_image, dir_image, clip_model)

        # ... code to determine whether this is the maximum score the program has seen...

Here is a rough structure of the parallel version:

def compare_all_images(candidate_image, directory, clip_model):
    pool_results = dict()
    pool = Pool()

    for dir_entry in os.scandir(directory):
        dir_image_path = dir_entry.path
        dir_image = Image.open(dir_image_path)
        pool_results[dir_image_path] = pool.apply_async(compare_images, args = (candidate_image, dir_image, clip_model)

    # Added everything to the pool, close it and wait for everything to finish
    pool.close()
    pool.join()

    # ... remaining code to determine which image has the highest similarity rating

I'm not sure where I might be erring.

The interesting thing here is that I also developed a smaller program to verify whether I was doing things correctly:

def func():
    sleep(6)

def main():
    pool = Pool()
    for i in range(20):
        pool.apply_async(func)
    pool.close()

    start = time()
    pool.join()
    stop = time()
    print("Time: %f" % (stop - start) ) # This gave an average of 12 seconds 
                                        # across multiple runs on my Windows 11 
                                        # machine, on which multiprocessing.cpu_count=12

Is this a problem with trying to make things parallel with sentence transformers, or does the problem lie elsewhere?

UPDATE: Now I'm especially confused. I'm now only passing str objects to the comparison function and have temporarily slapped a return 0 as the very first line in the function to see if I can further isolate the issue. Oddly, even though the parallel function is doing absolutely nothing now, several seconds (usually around 5) still seem to pass between the time that the pool is closed and the time that pool.join() finishes. Any thoughts?

UPDATE 2: I've done some more playing around, and have found out that an empty pool still has some overhead. This is the code I'm testing out currently:

            # ...
            pool = Pool()

            pool.close()
            start = time()
            DebuggingUtilities.debug("empty pool closed, doing a join on the empty pool to see if directory traversal is messing things up")
            pool.join()
            stop = time()

            DebuggingUtilities.debug("Empty pool join time: %f" % (stop - start) )

This gives me an "Empty pool join time" of about 5 seconds. Moving this snippet to the very first part of my main function still yields the same. Perhaps Pool works differently on Windows? In WSL (Ubuntu 20.04), the same code runs in about 0.02 seconds. So, what would cause even an empty Pool to hang for such a long time on Windows?

UPDATE 3: I've made another discovery. The empty pool problem goes away if the only imports I have are from multiprocessing import Pool and from time import time. However, the program uses a boatload of import statements across several source files, which causes the program to hang a bit when it first starts. I suspect that this is propagating down into the Pool for some reason. Unfortunately, I need all of the import statements that are in the source files, so I'm not sure how to get around this (or why the imports would affect an empty Pool).

UPDATE 4: So, apparently it's the from sentence_transformers import SentenceTransformer line that's causing issues (without that import, the pool.join() call happens relatively quickly. I think the easiest solution now is to simply move the compare_images function into a separate file. I'll update this question again with updates as I implement this.

UPDATE 5: I've done a little more playing around, and it seems like on Windows, the import statements get executed multiple times whenever a Pool gets created, which I think is just weird. Here's the code I used to verify this:

from multiprocessing import Pool
from datetime import datetime
from time import time
from utils import test

print("outside function lol")

def get_time():

    now = datetime.now()

    return "%02d/%02d/%04d - %02d:%02d:%02d" % (now.month, now.day, now.year, now.hour, now.minute, now.second)


def main():
    pool = Pool()

    print("Starting pool")

    """
    for i in range(4):
        print("applying %d to pool %s" % (i, get_time() ) )
        pool.apply_async(test, args = (i, ) )
    """

    pool.close()
    print("Pool closed, waiting for all processes to finish")
    start = time()
    pool.join()

    stop = time()

    print("pool done: %f" % (stop - start) )

if __name__ == "__main__":

    main()

Running through Windows command prompt:

outside function lol
Starting pool
Pool closed, waiting for all processes to finish
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
outside function lol
pool done: 4.794051

Running through WSL:

outside function lol
Starting pool
Pool closed, waiting for all processes to finish
pool done: 0.048856

UPDATE 6: I think I might have a workaround, which is to create the Pool in a file that doesn't directly or indirectly import anything from sentence_transformers. I then pass the model and anything else I need from sentence_transformers as parameters to a function that handles the Pool and kicks off all of the parallel processes. Since the sentence_transformers import seems to be the only problematic one, I'll wrap that import statement in an if __name__ == "__main__" so it only runs once, which will be fine, as I'm passing the things I need from it as parameters. It's a rather janky solution, and probably not what others would consider as "Pythonic", but I have a feeling this will work.

UPDATE 7: The workaround was successful. I've managed to get the pool join time on an empty pool down to something reasonable (0.2 - 0.4 seconds). The downside of this approach is that there is definitely considerable overhead in passing the entire model as a parameter to the parallel function, which I needed to do as a result of creating the Pool in a different place than the model was being imported. I'm quite close, though.

Solution

I've done a little more digging, and think I've finally discovered the root of the problem, and it has everything to do with what's described here.

To summarize, on Linux systems, processes are forked from the main process, meaning that the current process state is copied (which is why the import statements don't run multiple times). On Windows (and macOS), processes are spawned, meaning that interpreter starts at the beginning of the "main" file, thus running all import statements again. So, the behavior I'm seeing is not a bug, but I will need to rethink my program design to account for this.