Search code examples
pythonpandasnandrop-duplicates

Dropna when another row has the missing data OR drop_duplicates with NaN matching all data


I have data like the following:

Index  ID    data1  data2 ...
0      123   0      NaN   ...
1      123   0      1     ...
2      456   NaN    0     ...
3      456   NaN    0     ...
...

I need to drop rows which have less than or equal to the information available in otherwise identical rows.

In the example above rows 0 and either 2 xor 3 should be removed.

My best attempt so far is the rather slow, and also non-functioning:

df.groupby(by='ID').fillna(method='ffill',inplace=True).fillna(method='bfill',inplace=True)
df.drop_duplicates(inplace=True)

How can I best accomplish this goal?


Solution

  • You're approach seems fine, just using in-place assignment was not working here (since you're assigning to a copy of the data), use:

    df = df.groupby(by='ID', as_index=False).fillna(method='ffill').fillna(method='bfill')
    
    df.drop_duplicates()
    
       ID   data1  data2
    0  123    0.0    1.0
    2  456    NaN    0.0