Search code examples
pythonpandasdataframemergelookup

Pandas DataFrame: Based on a string in a row of col 'C', look that value up in col 'A', if found return the value of col 'B' for that matched row


I'm looking for a cleaner solution to this problem than the one I have come up with:

Based on the string value of a row in column 'C', look that value up in the string values of column 'A', and if found return the string value of column 'B' for that row.

I have a df that looks like this (notice loc[1]['C']):

        A       B       C
0   zeroA   zeroB   zeroC
1   oneA    oneB    zeroA
2   twoA    twoB    twoC

What I need: I need to add another column called 'BB' that is the result of looking up each string value from column 'C' in column 'A' and if a match is found return the string value of column 'B' for that row to my new column 'BB' (as seen at index[1] below). If not, return that row's column 'B' to column 'BB' (as seen at index[0,2] below).

        A       B       C      BB
0   zeroA   zeroB   zeroC   zeroB
1   oneA    oneB    zeroA   zeroB
2   twoA    twoB    twoC    twoB

I have come up with a way of doing this by:

  1. creating a temp_df,
  2. merge the tmp_df back into df,
  3. and then backfill the NaN rows as demonstrated below.

I'm sure there is a much cleaner way of doing this. Any suggestion would be appreciated.

What I came up with:

>>> data = {'A': ['zeroA', 'oneA', 'twoA'], 'B': ['zeroB', 'oneB', 'twoB'], 'C': ['zeroC', 'zeroA', 'twoC']}

>>> df = pd.DataFrame(data)

>>> df
       A      B      C
0  zeroA  zeroB  zeroC
1   oneA   oneB  zeroA
2   twoA   twoB   twoC


# Step 1. create a tmp_df containing columns ['A','B'] with 'A' renamed to 'C' and 'B' renamed to 'BB'
>>> tmp_df = df[['A', 'B']].rename({'A':'C', 'B':'BB'}, axis=1)

>>> tmp_df
       C     BB
0  zeroA  zeroB
1   oneA   oneB
2   twoA   twoB


# Step 2. Left merge tmp_df into df keyed on 'C' columns. 
>>> df = pd.merge(df, tmp_df, how='left', on='C')

>>> df
       A      B      C     BB
0  zeroA  zeroB  zeroC    NaN
1   oneA   oneB  zeroA  zeroB
2   twoA   twoB   twoC    NaN


# Step 3. Fill in the null rows of column 'BB' with it's row's column 'B' value.
>>> df.loc[df['BB'].isnull(), ['BB']] = df['B']

>>> df
       A      B      C     BB
0  zeroA  zeroB  zeroC  zeroB
1   oneA   oneB  zeroA  zeroB
2   twoA   twoB   twoC   twoB

Any suggestion on improving/eliminating the temp_df would be appreciated. Thank You


Solution

  • I prefer to apply a function in these scenarios:

    def find_value(row: pd.Series) -> pd.Series:
        val = df.query("@df['A'] == @row['C']")
    
        return row["B"] if val.empty else val["B"].squeeze()
    
    
    df["BB"] = df.apply(lambda x: find_value(x), axis=1)