Search code examples
pythonpandasfillna

Best way to fill NaN values from multiple columns at the same time using specific columns of the same dataframe as reference


Example:

DF = pd.DataFrame({'A': [0, 0, np.NaN, 0     , np.NaN, 0     , 0     , 0     ],
                   'B': [1, 1, np.NaN, 1     , np.NaN, 1     , 1     , 1     ],
                   'C': [8, 8, np.NaN, 8     , np.NaN, np.NaN, 8     , 8     ],
                   'D': [2, 2, 2     , np.NaN, np.NaN, 2     , np.NaN, np.NaN],
                   'E': [3, 3, 3     , np.NaN, np.NaN, 3     , np.NaN, np.NaN]})

Expected result that I want is to get the columns A and B as filled as possible, that is:

   1) If DF['A'] line is NaN, it should get the correspondent DF['D'] line
   2) If DF['B'] line is NaN, it should get the correspondent DF['E'] line
   3) DF['C'] shall remain as it is

I'm trying:

DF[['A', 'B']] = DF[['A','B']].fillna(DF[['D','E']])

But it seems it would work only if there was two different dataframes with same columns name. I could split DF in DF1 and DF2, rename DF2['D'] to A and DF2['E'] to B and do:

DF1[['A', 'B']] = DF1[['A','B']].fillna(DF2[['A','B']])

But I don't think this would be the best way. Any ideas?

The actual dataset has 3 million lines, so it would be good to get the most effective solution :)

Thanks!! :)


Solution

  • Use np.where is a good alternative since it works with the underlying numpy arrays:

    DF[['A','B']] = np.where(DF[['A','B']].isna(), DF[['D','E']], DF[['A','B']])
    

    Output:

         A    B    C    D    E
    0  0.0  1.0  8.0  2.0  3.0
    1  0.0  1.0  8.0  2.0  3.0
    2  2.0  3.0  NaN  2.0  3.0
    3  0.0  1.0  8.0  NaN  NaN
    4  NaN  NaN  NaN  NaN  NaN
    5  0.0  1.0  NaN  2.0  3.0
    6  0.0  1.0  8.0  NaN  NaN
    7  0.0  1.0  8.0  NaN  NaN