Search code examples
thread-safetyscipymultiprocessingsparse-matrixpython-multiprocessing

Safety of sharing a read-only scipy sparse matrix between multiple processes


I have a computation I must do which is somewhat expensive and I want to spawn multiple processes to complete it. The gist is more or less this:

1) I have a big scipy.sparse.csc_matrix (could use other sparse format if needed) from which I'm going to read (only read, never write) data for the calculation.

2) I must do lots of embarrassingly parallel calculations and return values.

So I did something like this:

import numpy as np
from multiprocessing import Process, Manager

def f(instance, big_matrix):
    """
    This is the actual thing I want to calculate. This reads lots of 
    data from big_matrix but never writes anything to it.
    """
    return stuff_calculated

def do_some_work(big_matrix, instances, outputs):
    """
    This do a few chunked calculations for a few instances and 
    saves the result in `outputs`, which is a memory shared dictionary.
    """
    for instance in instances:
        x = f(instance, big_matrix)
        outputs[instance] = x

def split_work(big_matrix, instances_to_calculate):
    """
    Split do_some_work into many processes by chunking instances_to_calculate, 
    creating a shared dictionary and spawning and joining the processes.
    """

    # break instance list into 4 chunks to pass each process
    instance_sets = np.array_split(instances_to_calculate, 4) 

    manager = Manager()
    outputs = manager.dict()

    processes = [
        Process(target=do_some_work, args=(big_matrix, instance_sets, outputs)) 
        for instances in instance_sets
    ]
    for p in processes:
        p.start()
    for p in processes:
        p.join()

    return user_sets, outputs

My question is: is this safe? My function f never writes anything, but I'm not taking any precaution to share the big_array between processes, just passing it as it is. It seems to be working but I'm concerned if I can corrupt anything just by passing a value between multiple processes even if I never write to it.

I tried to use the sharemem package to share the matrix between multiple processes but it seems to be unable to hold scipy sparse matrices, only normal numpy arrays.

If this isn't safe, how can I share (read only) big sparse matrices between processes without problems?

I've saw here that I can make another csc_matrix pointing to the same memory with:

other_matrix = csc_matrix(
    (bit_matrix.data, bit_matrix.indices, bit_matrix.indptr), 
    shape=bit_matrix.shape, 
    copy=False
)

Will this make it safer or would it be the same just as safe as passing the original object?

Thanks.


Solution

  • As explained here it seems your first option creates one copy of the sparse matrix per process. This is safe, but isn't ideal from a performance point of view. However, depending on the computation you perform on the sparse matrix, the overhead may not be signficant.

    I suspect a cleaner option using the multiprocessing lib would be to create three lists (depending on the matrix format you use) and populate these with the values, row_ind and col_ptr of your CSC matrix. The documentation for multiprocessing shows how this can be done using an Array or using the Manager and one of the supported types.

    Afterwards I don't see how you could run into trouble using read-only operations and it may be more efficient.