I am facing the challenge iteratively updating row
and index
in a for loop that makes use of the pd.iterrows()
generator. In the example below, my objective is to get the distance between each consecutive letter and the first letter (A) starting from index 0:
import pandas as pd
import string
data = {'letter':['A', 'C', 'D', 'X', 'X', 'Z', 'A', 'E', 'Z', 'Y', 'D', 'B', 'A']}
start_idx=df['letter'].iloc[0]
for index, row in df.iloc[1:].iterrows():
dist= abs(string.ascii_uppercase.index(df['letter'].iloc[0]) - \
string.ascii_uppercase.index(df['letter'].iloc[index]))
print(dist)
2
3
23
23
25
0
4
25
24
3
1
0
That is easy enough. However, if the distance exceeds 5, then I would like to start comparing the following letters to the last "normal" letter whose distance from the previous was <= 5 using a while loop and append the indexes of letters that deviate. For example:
import pandas as pd
import string
data = {'letter':['A', 'C', 'D', 'X', 'X', 'Z', 'A', 'E', 'Z', 'Y', 'D', 'B', 'A']}
bad_letters = []
start_idx=df['letter'].iloc[0]
compare_letter = string.ascii_uppercase.index(df['letter'].iloc[0])
for index, row in df.iloc[1:].iterrows():
dist= abs(compare_letter-string.ascii_uppercase.index(row['letter']))
if dist > 5:
compare_letter = string.ascii_uppercase.index(df['letter'][index-1]) #reset compare letter
abnormal=True
while abnormal:
bad_letters.append(index)
dist=abs(compare_letter-string.ascii_uppercase.index(df['letter'][index]))
index+=1 #increment index
if dist <=5:
abnormal=False
compare_letter=string.ascii_uppercase.index(df['letter'][index])
#?update iterrows index with this index#
break
else:
continue
The output list bad_letters
should be: [3,4,5,8,9] which correspond to:
-the index of letters X,X,Z which deviated more than 5 from D at index 2
-the index of letters Z,Y which deviated more than 5 from letter E at index 7.
The above attempt fails, and I am not sure how to structure this properly in a way that effectively uses iterrows()
with a while loop. How would one use a while loop inside of iterrows()
or a different pandas dataframe generator to answer this basic question? How can one iteratively "update" the index and row of the original for loop with the index that breaks the nested while loop? Any advice would be appreciated.
You can leverage less than and cumsum here to flag consecutive values exceeding the threshold and the prior value that doesn't. Based on that you only needs groups > 2, for which you can compare those groups values to the first value in the group, and output those which are still too far away.
import pandas as pd
data = pd.DataFrame({'letter':['A', 'C', 'D', 'X', 'X', 'Z', 'A', 'E', 'Z', 'Y', 'D', 'B', 'A']})
data['dist'] = data.letter.apply(ord)-ord('A')
data['group'] = data.dist.lt(5).cumsum()
data = data.groupby('group').filter(lambda x: len(x)>1)
data = data.groupby('group').apply(lambda x: (x['dist']-x['dist'].iloc[0])>5).reset_index()
data.loc[data['dist']==True]['level_1'].values
Output
array([3, 4, 5, 8, 9], dtype=int64)