Search code examples
pythonnumpylist-comprehensionnumpy-ndarray

Errors with indexing when using numpy delete and enumerate


Python 3.9

I have a numpy ndarray of strings. The actual array has thousands of strings, but let's say:

words_master = ['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
 'MARES']

I am trying to create a function that returns a list where the strings containing a given character have been deleted. This works as a while loop and if statement:

                index = 0
                temp = []
                while index != len(words_master):
                    idx = words_master[index]
                    if 'A' in idx:
                        temp.append(index)
                    index += 1
                words_master = np.delete(words_master, temp)

Since this is still a for loop and if statement, I'm wondering if it can be made more efficient using a list comprehension.

My best guess at this would be:

words_master = np.delete(words_master, np.argwhere([x for x, item in enumerate(words_master) if 'A' in item]))

Logic here is that np.delete will take the initial array and then delete all items at the indexes set by np.argwhere. However, it gives this output:

['CARES' 'BORES' 'MARES']

It appears that it ignores the first and last elements?

Other oddities: if I use 'CARES' in item, it returns the list without making any changes:

['CARES' 'BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES'
 'MARES']

And if I use any other parameter ('MARES' or 'M' or 'O') it seems to return the full list without the first word:

['BARES' 'CANES' 'TARES' 'PARES' 'BANES' 'BALES' 'CORES' 'BORES' 'MARES']

I tried:

  • Playing around with the index, for instance using (reversed(list(enumerate.. or making the list of indices -1. However, these result in the same type of patterns but just displaced.
  • Using np.where() instead, but am having similar problems.

I'm wondering if there is a clean way to fix that? Or is the while loop/if statement the best bet?

Edit: to the question "why not use list", I read that numpy arrays are a lot faster than python lists, and when I tested this same for-loop except using a python list with the remove() function, it was 10x slower on a larger dataset.


Solution

  • The argwhere returns indices where the enumerate is nonzero. That's not what you want.

    In [241]: [x for x, item in enumerate(words_master) if 'A' in item]
    Out[241]: [0, 1, 2, 3, 4, 5, 6, 9]
    In [242]: np.argwhere(_)
    Out[242]: 
    array([[1],
           [2],
           [3],
           [4],
           [5],
           [6],
           [7]])
    

    Without it the enumerate works just fine:

    In [247]: np.delete(words_master, [x for x, item in enumerate(words_master) if
         ...: 'A' in item])
    Out[247]: array(['CORES', 'BORES'], dtype='<U5')
    

    But compare its time with a pure comprehension:

    In [248]: timeit np.delete(words_master, [x for x, item in enumerate(words_master) if 'A' in item])
    27.8 µs ± 930 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    In [249]: timeit [word for word in words_master if word.find('A')==-1]
    1.73 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    In [251]: timeit [word for word in words_master if 'A' not in word]
    604 ns ± 2.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

    The enumerate part of the delete times about the same as the other comprehensions. So most of the time in [248] is the delete. While an array function, it isn't super fast. It may scale better than the comprehensions, but we still haven't gotten rid of those.

    In [252]: timeit [x for x, item in enumerate(words_master) if 'A' in item]
    1.06 µs ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

    If we start with an array of strings (instead of the list), ts faster to index it directly, rather than go through delete:

    In [279]: arr = np.array(words_master)
    In [280]: arr[['A' not in word for word in arr]]
    Out[280]: array(['CORES', 'BORES'], dtype='<U5')
    In [281]: timeit arr[['A' not in word for word in arr]]
    12.9 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    But we can improve on it by using both the array, and list (for the iteration):

    In [282]: timeit arr[['A' not in word for word in words_master]]
    6.27 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)