python python-2.7 parallel-processing joblib

Accessing and altering a global array using python joblib

I am attempting to use joblib in python to speed up some data processing but I'm having issues trying to work out how to assign the output into the required format. I have tried to generate a, perhaps overly simplistic, code which shows the issues that I'm encountering:

from joblib import Parallel, delayed
import numpy as np

def main():
    print "Nested loop array assignment:"
    regular()
    print "Parallel nested loop assignment using a single process:"
    par2(1)
    print "Parallel nested loop assignment using multiple process:"
    par2(2)

def regular():
    # Define variables
    a = [0,1,2,3,4]
    b = [0,1,2,3,4]
    # Set array variable to global and define size and shape
    global ab
    ab = np.zeros((2,np.size(a),np.size(b)))

    # Iterate to populate array
    for i in range(0,np.size(a)):
        for j in range(0,np.size(b)):
            func(i,j,a,b)

    # Show array output
    print ab

def par2(process):
    # Define variables
    a2 = [0,1,2,3,4]
    b2 = [0,1,2,3,4]
    # Set array variable to global and define size and shape
    global ab2
    ab2 = np.zeros((2,np.size(a2),np.size(b2)))

    # Parallel process in order to populate array
    Parallel(n_jobs=process)(delayed(func2)(i,j,a2,b2) for i in xrange(0,np.size(a2)) for j in xrange(0,np.size(b2)))

    # Show array output
    print ab2

def func(i,j,a,b):
    # Populate array
    ab[0,i,j] = a[i]+b[j]
    ab[1,i,j] = a[i]*b[j]

def func2(i,j,a2,b2):
    # Populate array
    ab2[0,i,j] = a2[i]+b2[j]
    ab2[1,i,j] = a2[i]*b2[j]

# Run script
main()

The ouput of which looks like this:

Nested loop array assignment:
[[[  0.   1.   2.   3.   4.]
  [  1.   2.   3.   4.   5.]
  [  2.   3.   4.   5.   6.]
  [  3.   4.   5.   6.   7.]
  [  4.   5.   6.   7.   8.]]

 [[  0.   0.   0.   0.   0.]
  [  0.   1.   2.   3.   4.]
  [  0.   2.   4.   6.   8.]
  [  0.   3.   6.   9.  12.]
  [  0.   4.   8.  12.  16.]]]
Parallel nested loop assignment using a single process:
[[[  0.   1.   2.   3.   4.]
  [  1.   2.   3.   4.   5.]
  [  2.   3.   4.   5.   6.]
  [  3.   4.   5.   6.   7.]
  [  4.   5.   6.   7.   8.]]

 [[  0.   0.   0.   0.   0.]
  [  0.   1.   2.   3.   4.]
  [  0.   2.   4.   6.   8.]
  [  0.   3.   6.   9.  12.]
  [  0.   4.   8.  12.  16.]]]
Parallel nested loop assignment using multiple process:
[[[ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]
  [ 0.  0.  0.  0.  0.]]]

From Google and StackOverflow search function it appears that when using joblib, the global array isn't shared between each subprocess. I'm uncertain if this is a limitation of joblib or if there is a way to get around this?

In reality my script is surrounded by other bits of code that rely on the final output of this global array being in a (4,x,x) format where x is variable (but typically ranges in the 100s to several thousands). This is my current reason for looking at parallel processing as the whole process can take up to 2 hours for x = 2400.

The use of joblib isn't necessary (but I like the nomenclature and simplicity) so feel free to suggest simple alternative methods, ideally keeping in mind the requirements of the final array. I'm using python 2.7.3, and joblib 0.7.1.

Solution

I was able to resolve the issues with this simple example using numpy's memmap. I was still having issues after using memmap and following the examples on the joblib documentation webpage but I upgraded to the latest joblib version (0.9.3) via pip and it all runs smoothly. Here is the working code:

from joblib import Parallel, delayed
import numpy as np
import os
import tempfile
import shutil

def main():

    print "Nested loop array assignment:"
    regular()

    print "Parallel nested loop assignment using numpy's memmap:"
    par3(4)

def regular():
    # Define variables
    a = [0,1,2,3,4]
    b = [0,1,2,3,4]

    # Set array variable to global and define size and shape
    global ab
    ab = np.zeros((2,np.size(a),np.size(b)))

    # Iterate to populate array
    for i in range(0,np.size(a)):
        for j in range(0,np.size(b)):
            func(i,j,a,b)

    # Show array output
    print ab

def par3(process):

    # Creat a temporary directory and define the array path
    path = tempfile.mkdtemp()
    ab3path = os.path.join(path,'ab3.mmap')

    # Define variables
    a3 = [0,1,2,3,4]
    b3 = [0,1,2,3,4]

    # Create the array using numpy's memmap
    ab3 = np.memmap(ab3path, dtype=float, shape=(2,np.size(a3),np.size(b3)), mode='w+')

    # Parallel process in order to populate array
    Parallel(n_jobs=process)(delayed(func3)(i,a3,b3,ab3) for i in xrange(0,np.size(a3)))

    # Show array output
    print ab3

    # Delete the temporary directory and contents
    try:
        shutil.rmtree(path)
    except:
        print "Couldn't delete folder: "+str(path)

def func(i,j,a,b):
    # Populate array
    ab[0,i,j] = a[i]+b[j]
    ab[1,i,j] = a[i]*b[j]

def func3(i,a3,b3,ab3):
    # Populate array
    for j in range(0,np.size(b3)):
        ab3[0,i,j] = a3[i]+b3[j]
        ab3[1,i,j] = a3[i]*b3[j]

# Run script
main()

Giving the following results:

Nested loop array assignment:
[[[  0.   1.   2.   3.   4.]
  [  1.   2.   3.   4.   5.]
  [  2.   3.   4.   5.   6.]
  [  3.   4.   5.   6.   7.]
  [  4.   5.   6.   7.   8.]]

 [[  0.   0.   0.   0.   0.]
  [  0.   1.   2.   3.   4.]
  [  0.   2.   4.   6.   8.]
  [  0.   3.   6.   9.  12.]
  [  0.   4.   8.  12.  16.]]]
Parallel nested loop assignment using numpy's memmap:
[[[  0.   1.   2.   3.   4.]
  [  1.   2.   3.   4.   5.]
  [  2.   3.   4.   5.   6.]
  [  3.   4.   5.   6.   7.]
  [  4.   5.   6.   7.   8.]]

 [[  0.   0.   0.   0.   0.]
  [  0.   1.   2.   3.   4.]
  [  0.   2.   4.   6.   8.]
  [  0.   3.   6.   9.  12.]
  [  0.   4.   8.  12.  16.]]]

A few of my thoughts to note for any future readers:

On small arrays, the time taken to prepare the parallel environment (generally referred to as overhead) means that this runs slower than the simple for loop.
Comparing a larger array eg. setting a and a3 to np.arange(0,10000), and b and b3 to np.arange(0,1000) gave times of 12.4s for the "regular" method and 7.7s for the joblib method.
The overheads meant that it was faster to let each core perform the inner j loop (see func3). This makes sense since I'm only starting 10,000 processes rather than starting 10,000,000
processes each of which would need setting up.