Search code examples
pythonpandasmissing-datapandas-groupby

Pandas keep the most complete rows


Lets say I have a dataframe that has a lot of missing data:

df = pd.DataFrame({'id': ['a','a','b','b','b','c','d','e','e','e'],
                   'q1': [1,1,np.NaN,np.NaN,0,np.NaN,1,np.NaN,1,0],
                   'q2': ['low',np.NaN,np.NaN,'high','low','high','high',np.NaN,np.NaN,'low'],
                   'q3': [np.NaN,1,np.NaN,1,0,0,1,0,np.NaN,np.NaN]
                   })

Which looks like this:

  id   q1    q2   q3
0  a  1.0   low  NaN
1  a  1.0   NaN  1.0
2  b  NaN   NaN  NaN
3  b  NaN  high  1.0
4  b  0.0   low  0.0
5  c  NaN  high  0.0
6  d  1.0  high  1.0
7  e  NaN   NaN  0.0
8  e  1.0   NaN  NaN
9  e  0.0   low  NaN

I want to create a new dataframe that contains only 1 row from each id, but that row is the most complete (least instances of NaN), but if theyre equally complete, then the first occurrence in the current sort order

Ideal output is a new dataframe:

  id   q1    q2   q3
0  a  1.0   low  NaN
1  b  0.0   low  0.0
2  c  NaN  high  0.0
3  d  1.0  high  1.0
4  e  0.0   low  NaN

I can count the number of NA in each row using df.isnull().sum(axis=1) but I'm not sure how to use that to then select out the row with the smallest sum, especially if there are more than 2 entries for an id


Solution

  • You could use a surrogate column to sort based on counts and filter with a groupby.

    df = df.assign(count=df.isnull().sum(1))\
           .sort_values(['id', 'count'])\
           .groupby('id', as_index=0).head(1)\
           .drop('count', 1)
    
    print(df)
      id   q1    q2   q3
    0  a  1.0   low  NaN
    4  b  0.0   low  0.0
    5  c  NaN  high  0.0
    6  d  1.0  high  1.0
    9  e  0.0   low  NaN