Search code examples
pythonpandasloopswhile-loop

Iterate over numbers in a column and get the first row number where the n amount of repetition starts


In my dataset I need to find when there are number of 0's repeated more than 280 times and get back the first row number of the start of the repetition in an array.I'm using python 3.11.

Sample data:

differences

0
0
0
0
0
0
0
.
.
.
5
5
.
.
0
0
0
0

or to create a sample dataset:

   ACD=[0,5]

   df2 = pd.DataFrame(np.repeat(ACD, 100, axis=0))
   df3=df2.sample(frac=1,axis=1).sample(frac=1).reset_index(drop=True)

So far my code:

c=[]
for values,row in df.loc[:, ['differences']].iterrows():
        i=0
        while row['differences']  == 0:
            count = sum(1 for i in row)
            i +=1
            if count > 280:
                continue
            c.append(np.where(row['differences']))
        else:
            values+=1

Expected output:

row_number_rep= [5,90,120] #showing the specific row numbers where the repetition stars.

With this code I get the error

<stdin>:8: DeprecationWarning: Calling nonzero on 0d arrays is deprecated, as it behaves surprisingly. Use atleast_1d(arr).nonzero() if the old behavior was intended.

I need help to improve this code. I think the problem is that I do not have first 280 items as 0's and I need to keep iterating over the whole column to find all the row number of beginning of 0's repeated 280 times.


Solution

  • pandas approach

    Assuming this example and a threshold of 4 (instead of 280):

    df = pd.DataFrame({'differences': [0,0,0,0,0,1,2,0,3,0,0,0,0,0,0,4,0,5]})
    
        differences
    0             0  # 0: first stretch of >4
    1             0
    2             0
    3             0
    4             0
    5             1
    6             2
    7             0
    8             3
    9             0  # 9: second stretch of >4
    10            0
    11            0
    12            0
    13            0
    14            0
    15            4
    16            0
    17            5
    

    You could perform a groupby.size to filter the groupby.first:

    thresh = 4
    
    m = df['differences'].eq(0)
    group = (~m).cumsum().to_numpy()
    
    g = df.reset_index()[m].groupby(group[m])
    g.size()
    
    out = g['index'].first()[g.size()>thresh].to_numpy()
    

    Output: array([ 0, 9])

    loop approach
    lst = [0, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0]
    thresh = 4
    
    start = -1
    zeros = False
    count = 0
    out = []
    for i, v in enumerate(lst+[-1]):
        if v==0:
            if not zeros:
                count = 0
                start = i
                zeros = True
            count += 1
            continue
        if count > thresh:
            if zeros:
                out.append(start)
        zeros = False
    
    out
    # [0, 9, 18]
    
    itertools.groupby approach:
    from itertools import groupby
    
    lst = [0, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0]
    thesh  = 4
    
    out = [x[0][0] for k,g in groupby(enumerate(lst), key=lambda x: x[1]==0)
           if k and len(x:=list(g))>thresh]
    # [0, 9, 18]