Search code examples
pandasnumpydataframemissing-datafillna

Fill NaN's within 1 column of a df via lookup to another df via pandas


I seen various versions of this question but none of them seem to fit with what I am attempting to do: here's my data:

Here's the df with the NaNs:

df = pd.DataFrame({"A": ["10023", "10040", np.nan, "12345", np.nan, np.nan, "10033", np.nan, np.nan],
               "B": [",", "17,-6", "19,-2", "17,-5", "37,-5", ",", "9,-10", "19,-2", "2,-5"],
               "C": ["small", "large", "large", "small", "small", "large", "small", "small", "large"]})

       A      B      C
0  10023      ,  small
1  10040  17,-6  large
2    NaN  19,-2  large
3  12345  17,-5  small
4    NaN  37,-5  small
5    NaN      ,  large
6  10033  9,-10  small
7    NaN  19,-2  small
8    NaN   2,-5  large

Next I have the lookup df called df2:

df2 = pd.DataFrame({"B": ['17,-5', '19,-2', '37,-5', '9,-10'],
                "A": ["10040", "54321", "12345", "10033"]})

       B      A
0  17,-5  10040
1  19,-2  54321
2  37,-5  12345
3  9,-10  10033

I would like to fill in the NaNs of column A on df by looking up column df2.B and returning df2.A such that the resulting dfr looks like this:

       A      B      C
0  10023      ,  small
1  10040  17,-6  large
2  54321  19,-2  large
3  10040  17,-5  small
4  12345  37,-5  small
5    NaN      ,  large
6  10033  9,-10  small
7  54321  19,-2  small
8    NaN   2,-5  large

Important caveats:

  1. The dfs do not have matching indexes
  2. The contents of df.A and df2.A are non-unique()
  3. The rows of df2 do make up unique pairs.
  4. Assume that there are more columns, not shown, with NaNs.

Using pandas, the rows of interest on df would be found (I think) via: df.loc[df['A'].isnull(),]. This answer seemed promising but I'm unclear where df1 in that example comes from. My actual data set is much, much larger than this and I'll have to be doing replacing several columns in this way.


Solution

  • Just using np.where

    df.A=np.where(df.A.isnull(),df.B.map(df2.set_index('B').A),df.A)
    df
    Out[149]: 
           A      B      C
    0  10023      ,  small
    1  10040  17,-6  large
    2  54321  19,-2  large
    3  12345  17,-5  small
    4  12345  37,-5  small
    5    NaN      ,  large
    6  10033  9,-10  small
    7  54321  19,-2  small
    8    NaN   2,-5  large