I am trying to compare the strings btw two DataFrame columns. category_df['column_text_to_find'] contains string to match in other dataframe df2['column_text_to_search']. The new column df2['matched text'] should return the df['column_text_to_find'] found in df2['column_text_to_search']. my expected result is
['column_text_to_search'] ['column_text_to_find'] ['matched text']
'SP * GRAPHICSDIRECT.ascdadv' 'GRAPHICSDIRECT' 'GRAPHICSDIRECT'
'99 CENTS ONLY #777#' '99 CENTS ONLY' '99 CENTS ONLY'
'PAYPAL *BESTBUY COM #3422#' 'BESTBUY' 'BESTBUY'
Unfortunately, my code returns an error.
csv import:
for f in all_files:
df = pd.read_csv(f, sep=',',header=[3])
df2 = df
remove blank spaces:
df2['column_text_to_search']=df2['column_text_to_search'].str.strip()
search and match text:
ch = category_df['column_text_to_find']
pat = r'\b({0})\b'.format('|'.join(ch))
df2['matched text'] = df2['column_text_to_search'].str.findall(pat, flags =
re.IGNORECASE).map("_".join)
df2.head()
Error:
TypeError: sequence item 0: expected str instance, tuple found
You can use
pattern = r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))
df["column_text_to_search"].str.findall(pattern).str.join('_')
Or, if your "words" to find can contain special chars anywhere in the string:
pattern = r'(?i)(?!\B\w)({0})(?<!\w\B)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
df["column_text_to_search"].str.findall(pattern).str.join('_')
Note the use of
(?i)
- it enables case insensitive search\b...\b
- word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)(?!\B\w)
/ (?<!\w\B)
- dynamic adaptive word boundaries that only require a word boundary if the neighbouring char in the word to find is a word char"|".join(df["column_text_to_find"].to_list())
- forms an alternation based pattern of values inside the column_text_to_find column.sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True)
- sorts the words to find by length in descending order and escapes them for use in regex.findall(pattern)
- finds all occurrences of the pattern and.str.join('_')
- joins them with _
.