I am looking for an efficient way to perform the following join on two pandas.DataFrames:
The first one contains in column A
strings like:
A ....
Spam|One
Spam|Two
Ham
Eggs
The second one is a reference table with string keys as the index and a text description:
index description
Spam "..."
Ham "..."
Eggs "...."
BakedBeans "...."
Each key in the index can be a prefix to a string in A
. Now I'd like to join the description column to column A where the value in A matches the prefix. I have a feeling that there is an efficient pandas one-liner for it...
We can assume that all values in A
have the format "prefix|suffix" or "prefix". Perhaps that can speed up things.
Use Series.str.split
with select first values of lists and Series.map
by Series
:
print (df1)
A
0 Spam|One
1 Spam|Two
2 Ham
3 Eggs
print (df2)
description
Spam aaa
Ham sd
Eggs rty
print (df2.index)
Index(['Spam', 'Ham', 'Eggs'], dtype='object')
df1['description'] = df1['A'].str.split('|').str[0].map(df2['description'])
print (df1)
A description
0 Spam|One aaa
1 Spam|Two aaa
2 Ham sd
3 Eggs rty
More general solution should be Series.str.extract
, with ^
for extract starting strings and join
by |
for regex OR
by df2.index
:
pat = '(' + '|'.join('^' + df2.index) + ')'
df1['new'] = df1['A'].str.extract(pat,expand=False).map(df2['description'])