Search code examples
pythonperformancenumpyparallel-processingdatabase-performance

Idea to speed up array processing


I want to create a dataset B by processing a dataset A. Therefore, every column in A (~ 2 Mio.) has to be processed in a batch-fashion (putting through a neural network), resulting in 3 outputs which are stacked together and then e.g. stored in a numpy array.

My code looks like the following, which seems to be not the best solution.

# Load data
data = get_data()

# Storage for B
B = np.empty(shape=data.shape)

# Process data
for idx, data_B in enumerate(data):
    # Process data
    a, b, c = model(data_B)

    # Reshape and feed in B
    B[idx * batch_size:batch_size * (idx + 1)] = np.squeeze(np.concatenate((a, b, c), axis=1))

I am looking for ideas to speed up the stacking or assigning process. I do not know if it is possible for parallel processing since everything should be stored in the same array finally (the ordering is not important). Is there any python framework I can use?

Loading the data takes 29s (only done once), stacking and assigning takes 20s for a batch size of only 2. The model command takes < 1s, allocating the array takes 5s and all other part <1s.


Solution

  • Your arrays shapes, and especially number of dimensions, is unclear. I can make a few guesses from what works in the code. Your times suggest that things are very large, so memory management may a big issue. Creating large temporary arrays takes time.

    What is data.shape? Probably 2d at least; B has the same shape

    B = np.empty(shape=data.shape)
    

    Now you iterate on the 1st dimension of data; lets call them rows, though they might be 2d or larger:

    # Process data
    for idx, data_B in enumerate(data):
        # Process data
        a, b, c = model(data_B)
    

    What the nature of a, etc. I'm assuming arrays, with a shape similar to data_B. But that just a guess.

        # Reshape and feed in B
        B[idx * batch_size:batch_size * (idx + 1)] =
             np.squeeze(np.concatenate((a, b, c), axis=1)
    

    For concatenate to work a,b,c must be 2d (at least). Lets guess they are all (n,m). The result is (n,3m). Why the squeeze? Is the shape (1,3m)?

    I don't know batch_size. But with anything other than 1 I don't think this works. B[idx:idx+1, :] = ... works since idx ranges the B.shape[0], but with other values it would produce an error.

    With this batchsize slice indexing it almost looks like you are trying to string out the iteration values in a long 1d array, batchsize values per iteration. But that doesn't fit with B matching data in shape.

    That puzzle aside, I wonder if you really need the concatenate. Can you initial B so you can assign values directly, e.g.

    B[idx, 0, ...] = a
    B[idx, 1, ...] = b
    etc
    

    Reshaping a array after filling is trivial. Even transposing axes isn't too time consuming.