Search code examples
pythonpandasduplicatespandas-groupbyfillna

How to fill in the nan values based on prior found values in the pandas dataframe?


I have the following example dataframe (normal dataframe has more than 1000 rows)

df = pd.DataFrame({'P1':['jaap','tim','piet','tim','tim'],
                   'P2':['piet','jaap','jaap','piet','jaap'],
                   'Count1':[2, 3, np.nan, np.nan, np.nan], 'Count2':[3, 1, np.nan, np.nan, np.nan]})
print(df)

     P1    P2  Count1  Count2
0  jaap  piet     2.0     3.0
1   tim  jaap     3.0     1.0
2  piet  jaap     NaN     NaN
3   tim  piet     NaN     NaN
4   tim  jaap     NaN     NaN

Now I want to find a neat way that fills in the NaN values, according to the following rule:

The names found in P1 and P2 have to be the same. 

Thus the nan values found in row number 2 have to be the same as the values in row 0, only the values have to be swapped since the names are also swapped. The nan values in row 3 should be kept as nan, since the combination of tim and piet is not found in any prior row. The nan values in row 4 have to be the same as the values in row 1. So the desired result is:

     P1    P2  Count1  Count2
0  jaap  piet     2.0     3.0
1   tim  jaap     3.0     1.0
2  piet  jaap     3.0     2.0
3   tim  piet     NaN     NaN
4   tim  jaap     3.0     1.0

This question is very similar: Pandas fillna using groupby

Only if the proposed solution in that article is applied to the question in this post, the result is slightly off:

df.groupby(['P1','P2'])[['Count1','Count2']].apply(lambda x: x.fillna(method = 'ffill'))
print(df)

             Count1  Count2
  P1   P2                  
0 jaap piet     2.0     3.0
1 tim  jaap     3.0     1.0
2 piet jaap     NaN     NaN
3 tim  piet     NaN     NaN
4 tim  jaap     3.0     1.0

As you can see the names jaap and piet in row 0 changed column in row 2, thus it doesn't work.


Solution

  • Idea is create DataFrame with changed order first in concat with rename, remove rows with missing values and possible duplicates and change original values by DataFrame.update:

    d = {'P2':'P1','P1':'P2','Count1':'Count2','Count2':'Count1'}
    df1 = (pd.concat([df, df.rename(columns=d)])
             .dropna(subset=['Count1','Count2'])
             .drop_duplicates(['P1','P2']))
    
    df = df.set_index(['P1','P2'])
    df1 = df1.set_index(['P1','P2'])
    
    df.update(df1)
    
    df = df.reset_index()
    print (df)
    
         P1    P2  Count1  Count2
    0  jaap  piet     2.0     3.0
    1   tim  jaap     3.0     1.0
    2  piet  jaap     3.0     2.0
    3   tim  piet     NaN     NaN
    4   tim  jaap     3.0     1.0