Search code examples
pythonregexpandascase-insensitive

how to get the word case from the text while pattern matching in python


I have a data frame with two columns Stg and Txt. The task is to check for all of the words in Stg Column with each Txt row and output the matched words into a new column while keeping the word case as in the Txt.

Example Code:

from pandas import DataFrame

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = DataFrame(new,columns= ['Stg','Txt'])

my_list = df["Stg"].tolist()
import re

def words_in_string(word_list, a_string):
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set:
                raise StopIteration 

df['new'] = ''

for i,values in enumerate(df['Txt']):
    a=[]
    b = []
    for word in words_in_string(my_list, values):
        a=word
        b.append(a)
    df['new'][i] = b
    exit

The above code returns the case from the Stg column. Is there a way to get the case from Txt. Also I want to check for the entire string and not the substring like in the case of the text 'two-way', the current code returns the word way.

Current Output:

    Stg            Txt                                   new
0   way           An early term                           []
1   Early         two-way allowed                         [way, allowed]
2   phone         New Phone feature that allowed          [allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]


Expected Output:

    Stg            Txt                                   new
0   way           An early term                           [early]
1   Early         two-way allowed                         [allowed]
2   phone         New Phone feature that allowed          [Phone, allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]

Solution

  • You should use Series.str.findall with negative lookbehind:

    import pandas as pd
    import re
    
    new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
            'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
            }
    
    df = pd.DataFrame(new,columns= ['Stg','Txt'])
    
    pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])
    
    df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
    
    print (df)
    
    #
              Stg                             Txt               new
    0         way                   An early term           [early]
    1       Early                 two-way allowed         [allowed]
    2       phone  New Phone feature that allowed  [Phone, allowed]
    3     allowed                amazing universe                []
    4        type                         new day                []
    5  brand name         the brand name is stage      [brand name]