I have a pandas dataframe that should look like this.
X Y Z Is Outlier
0 9.5 -2.3 4.13 False
1 17.5 3.3 0.22 False
2 NaN NaN -5.67 NaN
3 547.16 11.17 -288.67 True
4 -0.05 3.55 6.78 False
...
Some values in this dataframe are outliers. I came across this method of calculating the outliers in every colum using the z score:
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
My goal is to create a column Is Outlier
and put a True/False on each row that has/doesn't have at least one outlier and NaN for rows with at least one NaN value and, at the same time, keep a count of all "True" values.
This is my code so far.
dt = np.fromfile(path, dtype='float')
df = pd.DataFrame(dt.reshape(-1, 3), column = ['X', 'Y', 'Z'])
How can I go about doing this?
If you consider NaN rows to be noise, you can compute the zscore
dropping them, this will automatically give you NaNs when you assign the result:
from scipy.stats import zscore
thresh = 1
df['Is Outlier'] = zscore(df[['X', 'Y', 'Z']].dropna()).ge(thresh).any(1)
NB. I used at threshold of 1
for the example here.
Output:
X Y Z Is Outlier
0 9.50 -2.30 4.13 False
1 17.50 3.30 0.22 False
2 NaN NaN -5.67 NaN
3 547.16 11.17 -288.67 True
4 -0.05 3.55 6.78 False
Alternatively, zscore
has a nan_policy='omit'
option, but this wouldn't directly give you NaN in the output. The zscore computation however will use all values, including those from NaN rows. (This makes no difference in the final result here).
from scipy.stats import zscore
thresh = 1
df['Is Outlier'] = (zscore(df[['X', 'Y', 'Z']], nan_policy='omit')
.ge(thresh).any(1)
.mask(df[['X', 'Y', 'Z']].isna().any(1))
)
Output:
X Y Z Is Outlier
0 9.50 -2.30 4.13 False
1 17.50 3.30 0.22 False
2 NaN NaN -5.67 NaN
3 547.16 11.17 -288.67 True
4 -0.05 3.55 6.78 False