Search code examples
pythonnumpygroupinguniquesequence

Identifying the first element that belongs to different sequences in an array


I have an array (A) of sorted integers which contains ascending sequences with gaps.

A = array([1,2,3,4, 7,8,9, 23,24,25, 100])

I have an array (B) that contains a few values selected from A through an external process.

B = array([1,2,23,25,100])

I want to filter out values in B that belong to the same sequence in A so it returns only the first values of each unique sequence

C = array([1,23,100])

I have managed to do it by creating a second list to keep track of what has already been appended, but it seems kind of clumsy. I'm wondering if there is a better way to do this?

import numpy as np

A = np.array([1,2,3,4, 7,8,9, 23,24,25, 100])
B = np.array([1,2,23,25,100])
C = []

already_used_sequence = []

for x in enumerate(A):
    if x[0]-x[1] in already_used_sequence : #did we already group this sequence?
        pass
    elif len(np.intersect1d(B, x[1])) is not None: #is this value in B?
        for h in B:
            if h == x[1]: 
                C.append(x[1])
                already_used_sequence.append(x[0]-x[1])
C=np.array(C)

Solution

  • You need a groupby operation, which is not easily done with numpy. One option would be to use :

    import pandas as pd
    
    # convert to pandas Series
    s = pd.Series(A)
    
    # group by successive values
    # keep the first found value per group
    out = s[np.isin(A, B)].groupby(s.diff().gt(1).cumsum()).first().to_numpy()
    

    Output: array([ 1, 23, 100])