Search code examples
pythonnumpyoptimizationnumpy-ndarray

Manipulating 2D matrix using numpy boolean indexing takes a long time


I've generated a huge amount of random data like so:

ndata = np.random.binomial(1, 0.25, (100000, 1000))

which is a 100,000 by 1000 matrix(!)

I'm generating new matrix where for each row, each column is true if the mean of all the columns beforehand (minus the expectancy of bernoulli RV with p=0.25) is greater than equal some epsilon.

like so:

def true_false_inequality(data, eps, data_len):
    return [abs(np.mean(data[:index + 1]) - 0.25) >= eps for index in range(data_len)]

After doing so I'm generating a 1-d array (finally!) where each column represents how many true values I had in the same column in the matrix, and then I'm dividing every column by some number (exp_numer = 100,000)

def final_res(data, eps):
    tf = np.array([true_false_inequality(data[seq], eps, data_len) for seq in range(exp_number)])
    percentage = tf.sum(axis=0)/exp_number
    return percentage

Also I have 5 different epsilons which I iterate from to get my final result 5 times. (epsilons = [0.001, 0.1, 0.5, 0.25, 0.025])

My code does work, but it takes a long while for 100,000 rows by 1000 columns, I know I can make it faster by exploring the numpy functionality a little bit more but I just don't know how.


Solution

  • You can perform the whole calculation with vectorized operations on the full data array:

    mean = np.cumsum(data, axis=1) / np.arange(1, data.shape[1]+1)
    condition = np.abs(mean - 0.25) >= eps
    percentage = condition.sum(axis=0) / len(data)