python performance numpy boolean masking

Using a boolean Mask on large numpy array is very slow

I have a performance issue when coding with python. let's say I have 2 very large arrays (Nx2) of strings say with N = 12,000,000, and two variables label_a and label_b which are also strings. Here is the following code:

import numpy as np
import time

indices = np.array([np.random.choice(np.arange(5000).astype(str),size=10000000),np.random.choice(np.arange(5000).astype(str),size=10000000)]).T
costs = np.random.uniform(size=10000000)

label_a = '2'
label_b = '9'

t0 = time.time()    

costs = costs[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
indices = indices[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]

t1 = time.time()
toseq = t1-t0
print(toseq)

the above code segment takes 3 seconds every time it's ran. I would like to achieve the same thing while reducing the computing cost: I am using a boolean mask to only retrieve rows in the costs and indices arrays where the values are not label_a and label_b

Solution

As indicated in the comments, computing the values of the indices you're after only once, and combining them only once would save time.

(I've also changed the way of timing, just for brevity - the results are the same)

import numpy as np
from timeit import timeit

r = 5000
n = 10000000

indices = np.array([
    np.random.choice(np.arange(r).astype(str), size=n),
    np.random.choice(np.arange(r).astype(str), size=n)
]).T
costs = np.random.uniform(size=n)

label_a = '2'
label_b = '9'

n_indices = np.array([
    np.random.choice(np.arange(r), size=n),
    np.random.choice(np.arange(r), size=n)
]).T


def run():
    global indices
    global costs

    _ = costs[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
              (indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
    _ = indices[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
                (indices[:, 1] != label_a)*(indices[:, 1] != label_b)]


def run_faster():
    global indices
    global costs

    # only compute these only once
    not_a0 = indices[:, 0] != label_a
    not_b0 = indices[:, 0] != label_b
    not_a1 = indices[:, 1] != label_a
    not_b1 = indices[:, 1] != label_b
    _ = costs[not_a0 * not_b0 * not_a1 * not_b1]
    _ = indices[not_a0 * not_b0 * not_a1 * not_b1]


def run_even_faster():
    global indices
    global costs

    # also combine them only once
    cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
            (indices[:, 1] != label_a) * (indices[:, 1] != label_b))
    _ = costs[cond]
    _ = indices[cond]


def run_sep_mask():
    global indices
    global costs
    global cond

    # just the masking part of run_even_faster
    cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
            (indices[:, 1] != label_a) * (indices[:, 1] != label_b))


def run_sep_index():
    global indices
    global costs
    global cond

    # just the indexing part of run_even_faster
    _ = costs[cond]
    _ = indices[cond]


def run_even_faster_numerical():
    global indices
    global costs

    # use int values and n_indices instead of indices
    a = int(label_a)
    b = int(label_b)

    cond = ((n_indices[:, 0] != a) * (n_indices[:, 0] != b) *
            (n_indices[:, 1] != a) * (n_indices[:, 1] != b))
    _ = costs[cond]
    _ = indices[cond]


def run_all(funcs):
    for f in funcs:
        print('{:.4f} : {}()'.format(timeit(f, number=1), f.__name__))


run_all([run, run_faster, run_even_faster, run_sep_mask, run_sep_index, run_even_faster_numerical])

Note that I also added an example where the operation is not based on strings, but on numbers instead. If you can avoid the values being strings, but get numbers instead, you'd get a performance boost as well.

This boost gets substantial if you start comparing longer labels - in the end it might even be worth converting the strings to numbers before the filtering, if the strings get long enough.

These are my results:

0.9711 : run()
0.7065 : run_faster()
0.6983 : run_even_faster()
0.2657 : run_sep_mask()
0.4174 : run_sep_index()
0.4536 : run_even_faster_numerical()

The two sep entries show that the indexing is about twice the amount of time it takes to build the mask for run_even_faster, so you can only expect so much improvement from tuning it even more.

However, they also show that building the mask based on integers is less than 0.04 seconds on top of doing the actual indexing, compared to the about 0.26 seconds for building the mask based on strings. So, that's the room you have for improvement.