Search code examples

Why is the parallel version of my code slower than the serial one?

I am trying to run a model multiple times. As a result it is time consuming. As a solution I try to make it parallel. However, it ends up to be slower. Parallel is 40 seconds while serial is 34 seconds.

# !pip install --target=$nb_path transformers
oracle = pipeline(model="deepset/roberta-base-squad2")
question = 'When did the first extension of the Athens Tram take place?'
print("Data size is: ", len(data))

parallel = True

if parallel == False:
  counter = 0
  l = len(data)
  cr = []
  for words in data:
    print(counter, " out of ", l)
    cr.append(oracle(question=question, context=words))
elif parallel == True:
  from multiprocessing import Process, Queue
  import multiprocessing

  no_CPU = multiprocessing.cpu_count()
  print("Number of cpu : ", no_CPU)
  l = len(data)

  def answer_question(data, no_CPU, sub_no):
    cr_process = []
    counter_process = 0
    for words in data:
      l_data = len(data)
      # print("n is", no_CPU)
      # print("l is", l_data)
      print(counter_process, " out of ", l_data, "in subprocess number", sub_no)
      cr_process.append(oracle(question=question, context=words))
      # Q.put(cr_process)

  n = no_CPU      # number of subprocesses
  m = l//n        # number of data the n-1 first subprocesses will handle
  res = l % n     # number of extra data samples the last subprocesses has
  # print(m)
  # print(res)
  procs = []
  # instantiating process with arguments
  for x in range(n-1):
    # print(x*m)
    # print((x+1)*m)
    proc = Process(target=answer_question, args=(data[x*m:(x+1)*m],n, x+1,))
  proc = Process(target=answer_question, args=(data[(n-1)*m:n*m+res],n,n,))


  # complete the processes
  for proc in procs:

A sample of the data variable can be found here (to not flood the question). Argument parallel controls the serial and the parallel version. So my question is, why does it happen and how do I make the parallel version faster? I use google colab so it has 2 CPU cores available , that's what multiprocessing.cpu_count() is saying at least.


  • Your pipeline is already running on multi-cpu even when run as one process. The code of transformers are optimized to run on multi-cpu. when on top of that you are creating multiple process, you are loosing some time for building the processes and switching between them.

    To verify this, on the so-called "single process" version look at your cpu utilizations, you should already see all are at max, so creating extra parallel processes are not going to save you some time,