given two columns of a pandas dataframe:
import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])
I'd like to extract the substring of column word
that includes everything up to the end of the string in the corresponding column root
or NaN
if the string in root
is not included in word
. That is, the resulting dataframe would look as follows:
word root match
replay play replay
replayed play replay
playable play play
thinker think think
think think think
thoughtful think NaN
ex)mple ex)mple ex)mple
My dataframe has several thousand rows, so I'd like to avoid for-loops if necessary.
You can use a regex with str.extract
in a groupby
+apply
:
import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)
Or, if you expect few repeated "root" values:
import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)
output:
word root match
0 replay play replay
1 replayed play replay
2 playable play play
3 thinker think think
4 think think think
5 thoughtful think NaN