Search code examples
pythonpandasjoinstring-matching

joining pandas dataframes by string prefix


I am looking for an efficient way to perform the following join on two pandas.DataFrames:

The first one contains in column A strings like:

A  ....

Spam|One
Spam|Two
Ham
Eggs

The second one is a reference table with string keys as the index and a text description:

index      description
Spam       "..."
Ham        "..." 
Eggs       "...."
BakedBeans "...."

Each key in the index can be a prefix to a string in A. Now I'd like to join the description column to column A where the value in A matches the prefix. I have a feeling that there is an efficient pandas one-liner for it...

We can assume that all values in A have the format "prefix|suffix" or "prefix". Perhaps that can speed up things.


Solution

  • Use Series.str.split with select first values of lists and Series.map by Series:

    print (df1)
              A
    0  Spam|One
    1  Spam|Two
    2       Ham
    3      Eggs
    
    print (df2)  
         description
    Spam         aaa
    Ham           sd
    Eggs         rty
    
    print (df2.index)
    Index(['Spam', 'Ham', 'Eggs'], dtype='object')
    
    df1['description'] = df1['A'].str.split('|').str[0].map(df2['description'])
    print (df1)
              A description
    0  Spam|One         aaa
    1  Spam|Two         aaa
    2       Ham          sd
    3      Eggs         rty
    

    More general solution should be Series.str.extract, with ^ for extract starting strings and join by | for regex OR by df2.index:

    pat = '(' + '|'.join('^' + df2.index) + ')'
    df1['new'] = df1['A'].str.extract(pat,expand=False).map(df2['description'])