Search code examples
pythonpandasnathreshold

Python: Find the name of the columns where NA values are above a threshold


I know how to drop columns with many NA from a data frame,

df = df.dropna(axis=1, thresh=100)

But is there a way to know their names?

I though to create something like this:

na_values =  df.isnull().sum()/len(df)

Which gives me a list with the portion of NA within each columns

x_9       0.011700
x_10      0.001307
x_11      0.025830
x_12      0.011691
x_13      0.002739
x_14      0.238406

But I want to extract only the ones that have above a 0.2 threshold, like x_14.

Any ideas? I tried:

print(i > 0.4 for i in na_values)

but it doesn't work


Solution

  • You can use the same Series with a boolean condition to mask the Series:

    In [36]:
    na_values[na_values>0.2]
    
    Out[36]:
                na
    val           
    x_14  0.238406
    

    You can see the mask:

    In [37]:
    na_values > 0.2
    
    Out[37]:
    val
    x_9     False
    x_10    False
    x_11    False
    x_12    False
    x_13    False
    x_14     True
    Name: na, dtype: bool
    

    so this will mask out all the False rows