I have a 2D array of 0s, 1s and 2s with very large number of columns. I am trying to select only those rows which have consecutive zeros not exceeding certain number. My method is to convert the array into characters, merge columns and then apply the regular expression filter to it. But this is very slow. Especially the conversion and joining the characters in each row. Is there a way to make it faster by an order of magnitude? Maybe using another tactic altogether?
import re
import numpy as np
n=100
k = 1000
x = np.random.choice([0,1,2], replace=True, size=(n,k))
s = np.apply_along_axis(lambda t: ''.join(t) , 1, x.astype(str))
N_ramp=3
mask = [re.search(r'[12]0{1,'+str(N_ramp)+r'}[12]', i) is None for i in s]
Using this answer, you can get the counts of consecutive True
values. You can apply this to your problem by turning your array into a boolean array of True
if the value is 0 and False
otherwise. You then apply the linked algorithm to each row and check if there are any values in that result that meet your condition (the number of required consecutive zeros). I store these in a list. Printing out the sum shows how many rows meet the condition.
import numpy as np
n = 100
k = 1000
x = np.random.choice([0, 1, 2], replace=True, size=(n, k))
def get_consecutive_counts(arr):
# https://stackoverflow.com/a/24343375/12131013
return np.diff(np.where(np.concatenate(([arr[0]],
arr[:-1] != arr[1:],
[True])))[0])[::2]
def has_N_consecutive(arr, N):
return np.any(get_consecutive_counts(arr) > N)
N_consecutive = 7
res = [has_N_consecutive(row, N_consecutive) for row in x == 0]
print(sum(res))