Search code examples
pythonpandasextract

Extract pattern from a column based on another column's value


given two columns of a pandas dataframe:

import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
      'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])

I'd like to extract the substring of column word that includes everything up to the end of the string in the corresponding column root or NaN if the string in root is not included in word. That is, the resulting dataframe would look as follows:

word       root    match
replay     play    replay
replayed   play    replay
playable   play    play
thinker    think   think
think      think   think
thoughtful think   NaN
ex)mple    ex)mple ex)mple

My dataframe has several thousand rows, so I'd like to avoid for-loops if necessary.


Solution

  • You can use a regex with str.extract in a groupby+apply:

    import re
    df['match'] = (df.groupby('root')['word']
                     .apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
                   )
    

    Or, if you expect few repeated "root" values:

    import re
    df['match'] = df.apply(lambda r: m.group()
                           if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
                           else None, axis=1)
    

    output:

             word   root   match
    0      replay   play  replay
    1    replayed   play  replay
    2    playable   play    play
    3     thinker  think   think
    4       think  think   think
    5  thoughtful  think     NaN