I have my dataframe:
And I would like to create a subset of this dataframe with the columns in which there are the fewest NaN values (i.e. the most "valid" values)
In this case, I would only select the "A3" column as there is only one NaN and there are 3 in the others.
If there are two columns (or more) with the same number of NaN values, just select one of them (for example the first it does not matter).
The code for creating the dataframe:
df = pd.DataFrame({"A1":[np.NaN,1,0,0,np.NaN,0,1,np.NaN,0,0,0,1],
"A2":[0,1,np.NaN,0,1,np.NaN,1,0,np.NaN,0,0,1],
"A3":[0,1,np.NaN,0,1,0,1,0,0,0,0,2]})
df
You can sum up the number of null values in each column using pd.isnull
and .sum()
, then pick the column with the lowest count using .idxmin()
and select just that column from your dataframe:
df[pd.isnull(df).sum().idxmin()]
Output:
0 0.0
1 1.0
2 NaN
3 0.0
4 1.0
5 0.0
6 1.0
7 0.0
8 0.0
9 0.0
10 0.0
11 2.0
Name: A3, dtype: float64