I have a data frame with two columns Stg and Txt. The task is to check for all of the words in Stg Column with each Txt row and output the matched words into a new column while keeping the word case as in the Txt.
Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit
The above code returns the case from the Stg column. Is there a way to get the case from Txt. Also I want to check for the entire string and not the substring like in the case of the text 'two-way', the current code returns the word way.
Current Output:
Stg Txt new
0 way An early term []
1 Early two-way allowed [way, allowed]
2 phone New Phone feature that allowed [allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
Expected Output:
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
You should use Series.str.findall
with negative lookbehind:
import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]