Search code examples
pythonpass-by-referencenumpy-ndarray

How add element to a NumPy array in a Python function


I am trying to create a series of NumPy arrays from a text file using a pool of workers with multiprocessing module.

def process_line(line, x,y,z,t):
    sl = line.split()
    x = np.append(x,float(sl[0].replace(',','')))
    y = np.append(y,float(sl[1].replace(',','')))
    z = np.append(z,float(sl[2].replace(',','')))
    t = np.append(t,float(sl[3].replace(',','')))

def txt_to_HDF_converter(name, path_file):

    #init objects
    x = np.empty(0)
    y = np.empty(0)
    z = np.empty(0)
    t = np.empty(0)
    pool = mp.Pool(4)
    jobs = []

with open(path_file) as f:
    for line in f:
        jobs.append(pool.apply_async(process_line,(line,x,y,z,t)))

#wait for all jobs to finish
for job in jobs:
    job.get()
#clean up
pool.close()

The problem comes when the arrays are assigned in the process_line function, as if arguments where passed by value, at the end of the cycle I end up with arrays with only one element. Any idea of how get around this?


Solution

  • You are passing the values as part of a tuple in the code here:

            jobs.append(pool.apply_async(process_line,(line,x,y,z,t)))
    

    Then you unpack this tuple implicitly in the function:

    def process_line(line, x,y,z,t):
    

    Then you do not change the existing values but instead create new ones with these lines:

        x = np.append(x,float(sl[0].replace(',','')))
        y = np.append(y,float(sl[1].replace(',','')))
        z = np.append(z,float(sl[2].replace(',','')))
        t = np.append(t,float(sl[3].replace(',','')))
    

    Let me repeat this: You do not change the original arrays (as you appear to expect). Instead you just use the old values to create new values which you then assign to the local variables x, y, z, and t. Then you leave the function and forget about the new values. I would say this can never have any effect (also not for the last value) outside of the function.

    You have several options of going around this.

    1. Use global variables. This is a quick fix but bad style and in the long run you will hate me for this advice. But if you just need it to work quickly, then this might be your option.

    2. Return your values. After creating the new values, return them somehow and make sure that the next call gets the previously returned values again as input. This is the functional approach.

    3. Pass your values by reference. You can do this by instead of passing x create a one-element list. See the code below on how to do this. Passing references is typical C-style programming and not very Pythonic (but it works). Lots of IDEs will warn you about doing it this way and the typical Python developer will have a hard time understanding what you are doing there. A nicer variant of this is not to use simple lists but to put your data into some kind of object which will be passed by reference.

    x_ref = [x]
    y_ref = [y]
    y_ref = [y]
    t_ref = [t]
    
    with open(path_file) as f:
        for line in f:
            jobs.append(pool.apply_async(process_line,(line,x_ref,y_ref,z_ref,t_ref)))
    

    Then the process_line needs to be adjusted to expect references as well:

    def process_line(line, x_ref,y_ref,z_ref,t_ref):
        sl = line.split()
        x_ref[0] = np.append(x_ref[0],float(sl[0].replace(',','')))
        y_ref[0] = np.append(y_ref[0],float(sl[1].replace(',','')))
        z_ref[0] = np.append(z_ref[0],float(sl[2].replace(',','')))
        t_ref[0] = np.append(t_ref[0],float(sl[3].replace(',','')))