Search code examples
pythonspacymultiprocess

Function not returning in multiprocess, no errors


I am running multiple processes with Pool

import spacy
import multiprocessing
import logging

# global variable
nlp_bert = spacy.load("en_trf_bertbaseuncased_lg")
logging.basicConfig(level=logging.DEBUG)


def job_pool(data, job_number, job_to_do, groupby=None, split_col=None, **kwargs):
    pool = multiprocessing.Pool(processes=job_number)
    jobs = pool.map(job_to_do, data)
    return jobs


def job(slice):
    logging.debug('this shows')
    w1 = nlp_bert('word')
    w2 = nlp_bert('other')
    logging.debug(w1.similarity(w2))
    logging.debug("this doesn't")


job_pool([1, 2, 3, 4], 4, job)

The nlp_bert function does not return anything and there is no error. How can I find out what is going wrong? I have logging set to debug level already.

The function works outside of multiprocess - i.e. just writing it in a script and running the following.

import spacy
nlp_bert = spacy.load("en_trf_bertbaseuncased_lg")
w1 = nlp_bert('word')
w2 = nlp_bert('other')
print(w1.similarity(w2))

0.8381155446247196

I'm using:

  • Python 3.8.2
  • spacy Version: 2.3.2

Solution

  • It turns out this is a known issue with pytorch running multithreading in child processes, causing deadlocks.

    https://github.com/explosion/spaCy/issues/4667

    A workaround is to add the following:

    import torch
    
    torch.set_num_threads(1)