Search code examples
pythonpandasdataframedictionary

Fill Column Values using a dictionary and pattern matching


I am working on categorizing credit card transaction. Right now I am using a dictionary combined with np.select() as follows:

    def cat_mapper(frame, targ_col, cat_col):
    
        category_retailers = {'Online Shopping':['amazon','amzn mktp', 'target.com'],
                              'Wholesale Stores': ['costco', 'target'],
                             }
        cond = [frame[targ_col].str.contains('|'.join(category_retailers['Online Shopping']),regex=True,case=False),
                frame[targ_col].str.contains('|'.join(category_retailers['Wholesale Stores']),regex=True,case=False),
               ]

        choice = ['Online Shopping',
                  'Wholesale Stores'
                 ]
       
        default_cond = frame[cat_col]

        frame[cat_col] = np.select(cond, choice, default_cond)
        return frame

Where the frame parameter is the dataframe, targ_col parameter is the Description column with the transaction description or name and the cat_col parameter is the category column that will contain the transaction category.

The basic premise is to check the transaction description column for partial matches in the dictionary values, if there is a partial match in the description assign the corresponding dictionary key to the category column.

There are no issues with the functionality of the above block, but there is some redundancy in have to define dictionary values then match the dictionary values to corresponding conditions and choices for np.select.

Is there a way to pattern match the transaction description against a list of values in the dictionary and assign the dictionary key to the category column without using np.select as an intermediary?

I assume I can use a nested loop but even that seems to verbose. Is there a more eloquent way to achieve the same outcome.

sample dataframe:

data_dict = {'Description': ['amazon 345689','amzn mktp online 7765','amazon 4444','costco location','Wholefoods'],
           'Category':['NaN','NaN','NaN','NaN','Groceries']
          }
df = pd.DataFrame(data=data_dict)

Sample Output:

data_dict = {'Description': ['amazon 345689','amzn mkpt online 7765','amazon 4444','costco location','Wholefoods'],
           'Category':['Online Shopping','Online Shopping','Online Shopping','Online Shopping','Groceries']
          }
df = pd.DataFrame(data=data_dict)

The above sample output should be reproducible with the below code block and my current np.select() framework.

import pandas as pd
import numpy as np


def cat_mapper(frame, targ_col, cat_col):

    category_retailers = {'Online Shopping':['amazon','amzn mktp', 'target.com'],
                            'Wholesale Stores': ['costco', 'target'],
                            }
    cond = [frame[targ_col].str.contains('|'.join(category_retailers['Online Shopping']),regex=True,case=False),
            frame[targ_col].str.contains('|'.join(category_retailers['Wholesale Stores']),regex=True,case=False),
            ]

    choice = ['Online Shopping',
                'Wholesale Stores',
                ]

    default_cond = frame[cat_col]

    frame[cat_col] = np.select(cond, choice, default_cond)
    return frame

data_dict ={'Description': ['amazon 345689','amzn mktp online 7765','amazon 4444','costco location','Wholefoods'],
           'Category':['NaN','NaN','NaN','NaN','Groceries']
          }


df = pd.DataFrame(data=data_dict)

cat_mapper(df,'Description','Category')

Thanks in advance and let me know if you need me to provide any additional details


Solution

  • You could refactor your code by using a list comprehension to compute the conditions and taking advantage of the dictionary keys/values:

    def cat_mapper(frame, targ_col, cat_col):
        category_retailers = {'Online Shopping':['amazon','amzn mktp', 'target.com'],
                              'Wholesale Stores': ['costco', 'target'],
                             }
        cond = [frame[targ_col].str.contains('|'.join(val),regex=True,case=False)
                for val in category_retailers.values()]
    
        frame[cat_col] = np.select(cond, list(category_retailers), frame[cat_col])
        return frame
    
    cat_mapper(df, 'Description', 'Category')
    

    Output:

                 Description          Category
    0          amazon 345689   Online Shopping
    1  amzn mkpt online 7765               NaN  # not matching because of typo
    2            amazon 4444   Online Shopping
    3        costco location  Wholesale Stores
    4             Wholefoods         Groceries
    

    Used input:

    data_dict = {'Description': ['amazon 345689','amzn mkpt online 7765','amazon 4444','costco location','Wholefoods'],
                 'Category':['Nan','Nan','Nan','Nan','Groceries']
              }
    df = pd.DataFrame(data=data_dict).replace('Nan', float('nan'))