Search code examples
pythonpandasdataframenumpyz-score

How to calculate the outliers in a Pandas dataframe while excluding NaN values


I have a pandas dataframe that should look like this.

   X     Y     Z           Is Outlier
0  9.5 -2.3   4.13         False
1  17.5 3.3   0.22         False
2  NaN  NaN  -5.67         NaN
3  547.16  11.17  -288.67  True
4  -0.05  3.55  6.78       False
...

Some values in this dataframe are outliers. I came across this method of calculating the outliers in every colum using the z score:

df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

My goal is to create a column Is Outlier and put a True/False on each row that has/doesn't have at least one outlier and NaN for rows with at least one NaN value and, at the same time, keep a count of all "True" values.

This is my code so far.

dt = np.fromfile(path, dtype='float')
df = pd.DataFrame(dt.reshape(-1, 3), column = ['X', 'Y', 'Z'])

How can I go about doing this?


Solution

  • If you consider NaN rows to be noise, you can compute the zscore dropping them, this will automatically give you NaNs when you assign the result:

    from scipy.stats import zscore
    
    thresh = 1
    
    df['Is Outlier'] = zscore(df[['X', 'Y', 'Z']].dropna()).ge(thresh).any(1)
    

    NB. I used at threshold of 1 for the example here.

    Output:

            X      Y       Z Is Outlier
    0    9.50  -2.30    4.13      False
    1   17.50   3.30    0.22      False
    2     NaN    NaN   -5.67        NaN
    3  547.16  11.17 -288.67       True
    4   -0.05   3.55    6.78      False
    

    Alternatively, zscore has a nan_policy='omit' option, but this wouldn't directly give you NaN in the output. The zscore computation however will use all values, including those from NaN rows. (This makes no difference in the final result here).

    from scipy.stats import zscore
    
    thresh = 1
    
    df['Is Outlier'] = (zscore(df[['X', 'Y', 'Z']], nan_policy='omit')
                        .ge(thresh).any(1)
                        .mask(df[['X', 'Y', 'Z']].isna().any(1))
                        )
    

    Output:

            X      Y       Z Is Outlier
    0    9.50  -2.30    4.13      False
    1   17.50   3.30    0.22      False
    2     NaN    NaN   -5.67        NaN
    3  547.16  11.17 -288.67       True
    4   -0.05   3.55    6.78      False