Search code examples
pythonpandasdataframeexport-to-excel

How to flag an anomaly in a data frame (row wise)?


Python newbie here, I will like to flag sporadic numbers that are obviously off from the rest of the row. In simple terms, flag numbers that seem not to belong to each row. Numbers in 100s and 100000s are considered 'off the rest'

import pandas as pd
  
# intialise data of lists.
data = {'A':['R1', 'R2', 'R3', 'R4', 'R5'],
         'B':[12005, 18190, 1021, 13301, 31119,],
        'C':[11021, 19112, 19021,15, 24509 ],
        'D':[10022,19910, 19113,449999, 25519],
        'E':[14029, 29100, 39022, 24509, 412271],
        'F':[52119,32991,52883,69359,57835],
         'G':[41218, 52991,1021,69152,79355],
         'H': [43211,7672991,56881,211,77342],
          'J': [31211,42901,53818,62158,69325],
        }
  
# Create DataFrame
df = pd.DataFrame(data)
  
# Print the output.
df.describe()

I am trying to do something exactly like this

enter image description here

# I need help with step 1

#my code/pseudocode

# step 1:  identify the values in each row that are don't belong to the group


# step 2:  flag the identified values and export to excel

style_df = .applymap(lambda x: "background-color: yellow" if x else "") # flags the values that meets the criteria 

with pd.ExcelWriter("flagged_data.xlsx", engine="openpyxl") as writer:
    df.style.apply(lambda x: style_df, axis=None).to_excel(writer,index=False)



Solution

  • If you don't need to use machine learning outliers detection or Hampel filter and you already know the limits of your filter, you can simply do

    def higlight_outliers(s):
        # force to numeric and coerce string to NaN
        s = pd.to_numeric(s, errors='coerce')
        indexes = (s<1500)|(s>1000000)
        return ['background-color: yellow' if v else '' for v in indexes]
    
    styled = df.style.apply(higlight_outliers, axis=1)
    
    styled.to_excel("flagged_data.xlsx", index=False)
    

    enter image description here