Search code examples
regexpandastqdm

How to use pandas.Series.str.contains with tqdm progress map?


I'm trying to add a new column to a dataframe (dfA) based on values from another dataframe (dfB):

s = dfA['value'].tolist() 
dfB['value'] = dfB['text_bod'].str.contains('|'.join(s))

Can progress_map be used with this setup?

dfB['value] = 'dfB['text_bod].progress_map(func)'

Or is there some other way tqdm can be implemented?


Alternative method using FlashText:

from flashtext import KeywordProcessor

s = dfA['value'].tolist()

processor = KeywordProcessor()
processor.add_keywords_from_list(s)

dfB['value'] = dfB['text_bod'].progress_map(lambda x: processor.extract_keywords(x))

Solution

  • Not aware of a str.contains way, but you can use progress_map with a callback that does the exact same thing, but with re.search:

    import re
    dfB['value'] = dfB['text_bod'].progress_map(
        lambda x: bool(re.search('|'.join(s), x))
    )
    

    As a function, you can use

    def extract(x, p):
        m = p.search(x)
        if m:
            return m.groups(0)
        return np.nan
    
    p = re.compile('|'.join(s))
    dfB['value'] = dfB['text_bod'].progress_map(lambda x: extract(x, p))
    

    This should allow you greater flexibility than a lambda.