python numpy list-comprehension numpy-ndarray

Errors with indexing when using numpy delete and enumerate

Python 3.9

I have a numpy ndarray of strings. The actual array has thousands of strings, but let's say:

words_master = ['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
 'MARES']

I am trying to create a function that returns a list where the strings containing a given character have been deleted. This works as a while loop and if statement:

                index = 0
                temp = []
                while index != len(words_master):
                    idx = words_master[index]
                    if 'A' in idx:
                        temp.append(index)
                    index += 1
                words_master = np.delete(words_master, temp)

Since this is still a for loop and if statement, I'm wondering if it can be made more efficient using a list comprehension.

My best guess at this would be:

words_master = np.delete(words_master, np.argwhere([x for x, item in enumerate(words_master) if 'A' in item]))

Logic here is that np.delete will take the initial array and then delete all items at the indexes set by np.argwhere. However, it gives this output:

['CARES' 'BORES' 'MARES']

It appears that it ignores the first and last elements?

Other oddities: if I use 'CARES' in item, it returns the list without making any changes:

['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
 'MARES']

And if I use any other parameter ('MARES' or 'M' or 'O') it seems to return the full list without the first word:

['BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES' 'MARES']

I tried:

Playing around with the index, for instance using (reversed(list(enumerate.. or making the list of indices -1. However, these result in the same type of patterns but just displaced.
Using np.where() instead, but am having similar problems.

I'm wondering if there is a clean way to fix that? Or is the while loop/if statement the best bet?

Edit: to the question "why not use list", I read that numpy arrays are a lot faster than python lists, and when I tested this same for-loop except using a python list with the remove() function, it was 10x slower on a larger dataset.

Solution

The argwhere returns indices where the enumerate is nonzero. That's not what you want.

In [241]: [x for x, item in enumerate(words_master) if 'A' in item]
Out[241]: [0, 1, 2, 3, 4, 5, 6, 9]
In [242]: np.argwhere(_)
Out[242]: 
array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7]])

Without it the enumerate works just fine:

In [247]: np.delete(words_master, [x for x, item in enumerate(words_master) if
     ...: 'A' in item])
Out[247]: array(['CORES', 'BORES'], dtype='<U5')

But compare its time with a pure comprehension:

In [248]: timeit np.delete(words_master, [x for x, item in enumerate(words_master) if 'A' in item])
27.8 µs ± 930 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [249]: timeit [word for word in words_master if word.find('A')==-1]
1.73 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [251]: timeit [word for word in words_master if 'A' not in word]
604 ns ± 2.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The enumerate part of the delete times about the same as the other comprehensions. So most of the time in [248] is the delete. While an array function, it isn't super fast. It may scale better than the comprehensions, but we still haven't gotten rid of those.

In [252]: timeit [x for x, item in enumerate(words_master) if 'A' in item]
1.06 µs ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

If we start with an array of strings (instead of the list), ts faster to index it directly, rather than go through delete:

In [279]: arr = np.array(words_master)
In [280]: arr[['A' not in word for word in arr]]
Out[280]: array(['CORES', 'BORES'], dtype='<U5')
In [281]: timeit arr[['A' not in word for word in arr]]
12.9 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But we can improve on it by using both the array, and list (for the iteration):

In [282]: timeit arr[['A' not in word for word in words_master]]
6.27 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)