Search code examples
pythonpandaslistdictionarycategories

Reverse lookup of a sub-string in a dict of lists


I am struggling to find a "pythonic" way to make this logic work:

I have a dataframe of payment transactions with a column of strings( "beneficiary"):

index beneficiary
12 REWE SAGT DANKE. ...
13 NaN
14 OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE
15 NETFLIX INTERNATIONAL B.V.

I need to create another column in my dataframe which would be the category that every line belongs to.

index beneficiary category
12 REWE SAGT DANKE. ... Groceries
14 OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE Groceries
15 NETFLIX INTERNATIONAL B.V. Entertainment

I am thinking to make a dictionary like this below and somehow reverse look-up the sub-string values from the categories dict with the column above:

categories = {"Groceries": ["EDEKA", "REWE", "OBI"],
            "Entertainment": ["NETFLIX"]}

The logic is: if sub-string "REWE" is in df['beneficiary'] then df['category'] equals the key of the dict element where the sub-string is.

I am open to other mapping logic.


Solution

  • You can use a regex, for this you need to rework your dictionary to have the matched strings as key and categories as values:

    categories = {"Groceries": ["EDEKA", "REWE", "OBI"],
                  "Entertainment": ["NETFLIX"]}
    
    cat_sub = {v:k for k,l in categories.items() for v in l}
    regex = r'(%s)' % '|'.join(fr'\b{c}\b' for c in cat_sub)
    # regex looks like this: (\bEDEKA\b|\bREWE\b|\bOBI\b|\bNETFLIX\b)
    
    df['category'] = df['beneficiary'].str.extract(regex, expand=False).map(cat_sub)
    

    NB. I used word boundaries (\b) to ensure matching full words, use regex = r'(%s)' % '|'.join(cat_sub) if you don't want this behaviour

    output:

       index                                       beneficiary       category
    0      12                             REWE SAGT DANKE. ...      Groceries
    1      13                                              NaN            NaN
    2      14  OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE      Groceries
    3      15                       NETFLIX INTERNATIONAL B.V.  Entertainment
    

    NB. if needed to drop the NaNs, use dropna