Search code examples
pythonpandasfor-loopwhile-loopgenerator

Updating pandas index and row in nested while loop


I am facing the challenge iteratively updating row and index in a for loop that makes use of the pd.iterrows() generator. In the example below, my objective is to get the distance between each consecutive letter and the first letter (A) starting from index 0:

import pandas as pd
import string
 
data = {'letter':['A', 'C', 'D', 'X', 'X', 'Z', 'A', 'E', 'Z', 'Y', 'D', 'B', 'A']}
 
start_idx=df['letter'].iloc[0]

for index, row in df.iloc[1:].iterrows():
    
    dist= abs(string.ascii_uppercase.index(df['letter'].iloc[0]) - \
              string.ascii_uppercase.index(df['letter'].iloc[index]))
              
    print(dist)
2
3
23
23
25
0
4
25
24
3
1
0

That is easy enough. However, if the distance exceeds 5, then I would like to start comparing the following letters to the last "normal" letter whose distance from the previous was <= 5 using a while loop and append the indexes of letters that deviate. For example:

import pandas as pd
import string
 
data = {'letter':['A', 'C', 'D', 'X', 'X', 'Z', 'A', 'E', 'Z', 'Y', 'D', 'B', 'A']}
bad_letters = [] 
    

start_idx=df['letter'].iloc[0]
compare_letter = string.ascii_uppercase.index(df['letter'].iloc[0])

for index, row in df.iloc[1:].iterrows():
                                              
    dist= abs(compare_letter-string.ascii_uppercase.index(row['letter']))  
    
    if dist > 5:
                                              
        compare_letter =  string.ascii_uppercase.index(df['letter'][index-1]) #reset compare letter
        abnormal=True

        while abnormal:
            
            bad_letters.append(index)
            dist=abs(compare_letter-string.ascii_uppercase.index(df['letter'][index]))
            index+=1 #increment index
            
            if dist <=5:
                abnormal=False
                compare_letter=string.ascii_uppercase.index(df['letter'][index])
                #?update iterrows index with this index#
                break
            
        else:
            continue
       

The output list bad_letters should be: [3,4,5,8,9] which correspond to:

-the index of letters X,X,Z which deviated more than 5 from D at index 2
-the index of letters Z,Y which deviated more than 5 from letter E at index 7.

The above attempt fails, and I am not sure how to structure this properly in a way that effectively uses iterrows() with a while loop. How would one use a while loop inside of iterrows() or a different pandas dataframe generator to answer this basic question? How can one iteratively "update" the index and row of the original for loop with the index that breaks the nested while loop? Any advice would be appreciated.


Solution

  • You can leverage less than and cumsum here to flag consecutive values exceeding the threshold and the prior value that doesn't. Based on that you only needs groups > 2, for which you can compare those groups values to the first value in the group, and output those which are still too far away.

    import pandas as pd
     
    data = pd.DataFrame({'letter':['A', 'C', 'D', 'X', 'X', 'Z', 'A', 'E', 'Z', 'Y', 'D', 'B', 'A']})
    
    data['dist'] = data.letter.apply(ord)-ord('A')
    data['group']  = data.dist.lt(5).cumsum()
    
    
    data = data.groupby('group').filter(lambda x: len(x)>1)
    data = data.groupby('group').apply(lambda x: (x['dist']-x['dist'].iloc[0])>5).reset_index()
    
    data.loc[data['dist']==True]['level_1'].values
    

    Output

    array([3, 4, 5, 8, 9], dtype=int64)