Lets say I have a dataframe that has a lot of missing data:
df = pd.DataFrame({'id': ['a','a','b','b','b','c','d','e','e','e'],
'q1': [1,1,np.NaN,np.NaN,0,np.NaN,1,np.NaN,1,0],
'q2': ['low',np.NaN,np.NaN,'high','low','high','high',np.NaN,np.NaN,'low'],
'q3': [np.NaN,1,np.NaN,1,0,0,1,0,np.NaN,np.NaN]
})
Which looks like this:
id q1 q2 q3
0 a 1.0 low NaN
1 a 1.0 NaN 1.0
2 b NaN NaN NaN
3 b NaN high 1.0
4 b 0.0 low 0.0
5 c NaN high 0.0
6 d 1.0 high 1.0
7 e NaN NaN 0.0
8 e 1.0 NaN NaN
9 e 0.0 low NaN
I want to create a new dataframe that contains only 1 row from each id
, but that row is the most complete (least instances of NaN), but if theyre equally complete, then the first occurrence in the current sort order
Ideal output is a new dataframe:
id q1 q2 q3
0 a 1.0 low NaN
1 b 0.0 low 0.0
2 c NaN high 0.0
3 d 1.0 high 1.0
4 e 0.0 low NaN
I can count the number of NA in each row using df.isnull().sum(axis=1)
but I'm not sure how to use that to then select out the row with the smallest sum, especially if there are more than 2 entries for an id
You could use a surrogate column to sort based on counts and filter with a groupby
.
df = df.assign(count=df.isnull().sum(1))\
.sort_values(['id', 'count'])\
.groupby('id', as_index=0).head(1)\
.drop('count', 1)
print(df)
id q1 q2 q3
0 a 1.0 low NaN
4 b 0.0 low 0.0
5 c NaN high 0.0
6 d 1.0 high 1.0
9 e 0.0 low NaN