Search code examples
pythonlistnumpy

Numpy: Duplicate mask for an array (returning True if we've seen that value before, False otherwise)


I'm looking for a vectorized function that returns a mask with values of True if the value in the array has been seen before and False otherwise.

I'm looking for the fastest solution possible as speed is very important.

For example this is what I would like to see:

array = [1, 2, 1, 2, 3]
mask = [False, False, True, True, False]

So is_duplicate = array[mask] should return [1, 2].

Is there a fast, vectorized way to do this? Thanks!


Solution

  • Approach #1 : With sorting

    def mask_firstocc(a):
        sidx = a.argsort(kind='stable')
        b = a[sidx]
        out = np.r_[False,b[:-1] == b[1:]][sidx.argsort()]
        return out
    

    We can use array-assignment to boost perf. further -

    def mask_firstocc_v2(a):
        sidx = a.argsort(kind='stable')
        b = a[sidx]
        mask = np.r_[False,b[:-1] == b[1:]]
        out = np.empty(len(a), dtype=bool)
        out[sidx] = mask
        return out
    

    Sample run -

    In [166]: a
    Out[166]: array([2, 1, 1, 0, 0, 4, 0, 3])
    
    In [167]: mask_firstocc(a)
    Out[167]: array([False, False,  True, False,  True, False,  True, False])
    

    Approach #2 : With np.unique(..., return_index)

    We can leverage np.unique with its return_index which seems to return the first occurence of each unique elemnent, hence a simple array-assignment and then indexing works -

    def mask_firstocc_with_unique(a):
        mask = np.ones(len(a), dtype=bool)
        mask[np.unique(a, return_index=True)[1]] = False
        return mask