Python 3.9
I have a numpy ndarray of strings. The actual array has thousands of strings, but let's say:
words_master = ['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
'MARES']
I am trying to create a function that returns a list where the strings containing a given character have been deleted. This works as a while loop and if statement:
index = 0
temp = []
while index != len(words_master):
idx = words_master[index]
if 'A' in idx:
temp.append(index)
index += 1
words_master = np.delete(words_master, temp)
Since this is still a for loop and if statement, I'm wondering if it can be made more efficient using a list comprehension.
My best guess at this would be:
words_master = np.delete(words_master, np.argwhere([x for x, item in enumerate(words_master) if 'A' in item]))
Logic here is that np.delete will take the initial array and then delete all items at the indexes set by np.argwhere. However, it gives this output:
['CARES' 'BORES' 'MARES']
It appears that it ignores the first and last elements?
Other oddities: if I use 'CARES' in item, it returns the list without making any changes:
['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
'MARES']
And if I use any other parameter ('MARES' or 'M' or 'O') it seems to return the full list without the first word:
['BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES' 'MARES']
I tried:
I'm wondering if there is a clean way to fix that? Or is the while loop/if statement the best bet?
Edit: to the question "why not use list", I read that numpy arrays are a lot faster than python lists, and when I tested this same for-loop except using a python list with the remove() function, it was 10x slower on a larger dataset.
The argwhere
returns indices where the enumerate
is nonzero. That's not what you want.
In [241]: [x for x, item in enumerate(words_master) if 'A' in item]
Out[241]: [0, 1, 2, 3, 4, 5, 6, 9]
In [242]: np.argwhere(_)
Out[242]:
array([[1],
[2],
[3],
[4],
[5],
[6],
[7]])
Without it the enumerate
works just fine:
In [247]: np.delete(words_master, [x for x, item in enumerate(words_master) if
...: 'A' in item])
Out[247]: array(['CORES', 'BORES'], dtype='<U5')
But compare its time with a pure comprehension:
In [248]: timeit np.delete(words_master, [x for x, item in enumerate(words_master) if 'A' in item])
27.8 µs ± 930 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [249]: timeit [word for word in words_master if word.find('A')==-1]
1.73 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [251]: timeit [word for word in words_master if 'A' not in word]
604 ns ± 2.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The enumerate part of the delete
times about the same as the other comprehensions. So most of the time in [248] is the delete
. While an array function, it isn't super fast. It may scale better than the comprehensions, but we still haven't gotten rid of those.
In [252]: timeit [x for x, item in enumerate(words_master) if 'A' in item]
1.06 µs ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If we start with an array of strings (instead of the list), ts faster to index it directly, rather than go through delete
:
In [279]: arr = np.array(words_master)
In [280]: arr[['A' not in word for word in arr]]
Out[280]: array(['CORES', 'BORES'], dtype='<U5')
In [281]: timeit arr[['A' not in word for word in arr]]
12.9 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But we can improve on it by using both the array, and list (for the iteration):
In [282]: timeit arr[['A' not in word for word in words_master]]
6.27 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)