Search code examples
pythonperformancenumpybooleanmasking

Using a boolean Mask on large numpy array is very slow


I have a performance issue when coding with python. let's say I have 2 very large arrays (Nx2) of strings say with N = 12,000,000, and two variables label_a and label_b which are also strings. Here is the following code:

import numpy as np
import time

indices = np.array([np.random.choice(np.arange(5000).astype(str),size=10000000),np.random.choice(np.arange(5000).astype(str),size=10000000)]).T
costs = np.random.uniform(size=10000000)

label_a = '2'
label_b = '9'

t0 = time.time()    

costs = costs[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
indices = indices[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]

t1 = time.time()
toseq = t1-t0
print(toseq)

the above code segment takes 3 seconds every time it's ran. I would like to achieve the same thing while reducing the computing cost: I am using a boolean mask to only retrieve rows in the costs and indices arrays where the values are not label_a and label_b


Solution

  • As indicated in the comments, computing the values of the indices you're after only once, and combining them only once would save time.

    (I've also changed the way of timing, just for brevity - the results are the same)

    import numpy as np
    from timeit import timeit
    
    r = 5000
    n = 10000000
    
    indices = np.array([
        np.random.choice(np.arange(r).astype(str), size=n),
        np.random.choice(np.arange(r).astype(str), size=n)
    ]).T
    costs = np.random.uniform(size=n)
    
    label_a = '2'
    label_b = '9'
    
    n_indices = np.array([
        np.random.choice(np.arange(r), size=n),
        np.random.choice(np.arange(r), size=n)
    ]).T
    
    
    def run():
        global indices
        global costs
    
        _ = costs[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
                  (indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
        _ = indices[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
                    (indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
    
    
    def run_faster():
        global indices
        global costs
    
        # only compute these only once
        not_a0 = indices[:, 0] != label_a
        not_b0 = indices[:, 0] != label_b
        not_a1 = indices[:, 1] != label_a
        not_b1 = indices[:, 1] != label_b
        _ = costs[not_a0 * not_b0 * not_a1 * not_b1]
        _ = indices[not_a0 * not_b0 * not_a1 * not_b1]
    
    
    def run_even_faster():
        global indices
        global costs
    
        # also combine them only once
        cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
                (indices[:, 1] != label_a) * (indices[:, 1] != label_b))
        _ = costs[cond]
        _ = indices[cond]
    
    
    def run_sep_mask():
        global indices
        global costs
        global cond
    
        # just the masking part of run_even_faster
        cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
                (indices[:, 1] != label_a) * (indices[:, 1] != label_b))
    
    
    def run_sep_index():
        global indices
        global costs
        global cond
    
        # just the indexing part of run_even_faster
        _ = costs[cond]
        _ = indices[cond]
    
    
    def run_even_faster_numerical():
        global indices
        global costs
    
        # use int values and n_indices instead of indices
        a = int(label_a)
        b = int(label_b)
    
        cond = ((n_indices[:, 0] != a) * (n_indices[:, 0] != b) *
                (n_indices[:, 1] != a) * (n_indices[:, 1] != b))
        _ = costs[cond]
        _ = indices[cond]
    
    
    def run_all(funcs):
        for f in funcs:
            print('{:.4f} : {}()'.format(timeit(f, number=1), f.__name__))
    
    
    run_all([run, run_faster, run_even_faster, run_sep_mask, run_sep_index, run_even_faster_numerical])
    

    Note that I also added an example where the operation is not based on strings, but on numbers instead. If you can avoid the values being strings, but get numbers instead, you'd get a performance boost as well.

    This boost gets substantial if you start comparing longer labels - in the end it might even be worth converting the strings to numbers before the filtering, if the strings get long enough.

    These are my results:

    0.9711 : run()
    0.7065 : run_faster()
    0.6983 : run_even_faster()
    0.2657 : run_sep_mask()
    0.4174 : run_sep_index()
    0.4536 : run_even_faster_numerical()
    

    The two sep entries show that the indexing is about twice the amount of time it takes to build the mask for run_even_faster, so you can only expect so much improvement from tuning it even more.

    However, they also show that building the mask based on integers is less than 0.04 seconds on top of doing the actual indexing, compared to the about 0.26 seconds for building the mask based on strings. So, that's the room you have for improvement.