Search code examples
pythonnumpyparallel-processingvectorizationjoblib

Efficiently filling NumPy array using lists of indices


I know how to execute a parallel loop in joblib that returns a list as result.

However, is it possible to fill a predefined numpy matrix in parallel?

Imagine the following minimal example matrix and data:

column_data = ['a', 'b', 'c', 'd', 'e', 'f', 'x']
data = [['a', 'b', 'c'],
        ['d', 'c'],
        ['e', 'f', 'd', 'x']]
x = np.zeros((len(data), len(column_data))

Note that column_data is sorted and unique. data is a list of lists, not a rectangular matrix.

The loop:

for row in range(len(data)):
    for column in data[row]:
        x[row][column_data.index(column)] = 1

It is possible to parallellise this loop? Filling in a 70,000 x 10,000 matrix is quite slow without parallellisation.


Solution

  • Here's an almost vectorized approach -

    lens = [len(item) for item in data]    
    A = np.concatenate((column_data,np.concatenate(data)))
    _,idx = np.unique(A,return_inverse=True)
    
    R = np.repeat(np.arange(len(lens)),lens)
    C = idx[len(column_data):]
    
    out = np.zeros((len(data), len(column_data)))    
    out[R,C] = 1
    

    Here's another -

    lens = [len(item) for item in data]
    R = np.repeat(np.arange(len(lens)),lens)
    C = np.searchsorted(column_data,np.concatenate(data))
    
    out = np.zeros((len(data), len(column_data)))
    out[R,C] = 1