Search code examples
pythonpandasdataframestring-matchingfindall

Python - Findall matching string(s) between two DataFrame columns - sequence item 0: expected str instance, tuple found


I am trying to compare the strings btw two DataFrame columns. category_df['column_text_to_find'] contains string to match in other dataframe df2['column_text_to_search']. The new column df2['matched text'] should return the df['column_text_to_find'] found in df2['column_text_to_search']. my expected result is

      ['column_text_to_search']   ['column_text_to_find']     ['matched text']
   'SP * GRAPHICSDIRECT.ascdadv'     'GRAPHICSDIRECT'          'GRAPHICSDIRECT'
   '99 CENTS ONLY #777#'             '99 CENTS ONLY'           '99 CENTS ONLY'
   'PAYPAL *BESTBUY COM  #3422#'     'BESTBUY'                 'BESTBUY'

  

Unfortunately, my code returns an error.

csv import:

for f in all_files:
    df = pd.read_csv(f, sep=',',header=[3])
df2 = df

remove blank spaces:

df2['column_text_to_search']=df2['column_text_to_search'].str.strip()

search and match text:

 ch = category_df['column_text_to_find']
 pat = r'\b({0})\b'.format('|'.join(ch))
 df2['matched text'] = df2['column_text_to_search'].str.findall(pat, flags = 
 re.IGNORECASE).map("_".join)
 df2.head()

Error:

TypeError: sequence item 0: expected str instance, tuple found

Solution

  • You can use

    pattern = r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))
    df["column_text_to_search"].str.findall(pattern).str.join('_')
    

    Or, if your "words" to find can contain special chars anywhere in the string:

    pattern = r'(?i)(?!\B\w)({0})(?<!\w\B)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
    df["column_text_to_search"].str.findall(pattern).str.join('_')
    

    Note the use of

    • (?i) - it enables case insensitive search
    • \b...\b - word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)
    • (?!\B\w) / (?<!\w\B) - dynamic adaptive word boundaries that only require a word boundary if the neighbouring char in the word to find is a word char
    • "|".join(df["column_text_to_find"].to_list()) - forms an alternation based pattern of values inside the column_text_to_find column.
    • sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) - sorts the words to find by length in descending order and escapes them for use in regex
    • .findall(pattern) - finds all occurrences of the pattern and
    • .str.join('_') - joins them with _.