I know how to drop columns with many NA from a data frame,
df = df.dropna(axis=1, thresh=100)
But is there a way to know their names?
I though to create something like this:
na_values = df.isnull().sum()/len(df)
Which gives me a list with the portion of NA within each columns
x_9 0.011700
x_10 0.001307
x_11 0.025830
x_12 0.011691
x_13 0.002739
x_14 0.238406
But I want to extract only the ones that have above a 0.2 threshold, like x_14.
Any ideas? I tried:
print(i > 0.4 for i in na_values)
but it doesn't work
You can use the same Series
with a boolean condition to mask the Series:
In [36]:
na_values[na_values>0.2]
Out[36]:
na
val
x_14 0.238406
You can see the mask:
In [37]:
na_values > 0.2
Out[37]:
val
x_9 False
x_10 False
x_11 False
x_12 False
x_13 False
x_14 True
Name: na, dtype: bool
so this will mask out all the False
rows