Search code examples
pythonnumpycategorization

How to more easily assign a category to a string in a new column with over 50 categories


I have a dataframe with a column of open response strings that identify a state in the US (hopefully, this will become a closed ended question soon). I need to assign each response with a state name and am currently working with the following code.

alabama_cat = ["alabama", "al"]
alaska_cat = ["alaska", "ak"]
newyork_cat = ["new york", "ny", "newyork"]

state_cat = [alabama_cat, alaska_cat, newyork_cat]

#Conditions for categories
conditions = [
    (survey['state'].str.lower().str.contains('|'.join(alabama_cat), na=False)),
    (survey['state'].str.lower().str.contains('|'.join(alaska_cat), na=False)),
    (survey['state'].str.lower().str.contains('|'.join(newyork_cat), na=False)),
]

#Names of categories
choices = ["Alabama", "Alaska", "New York"]

# categorize
survey['state_category'] = np.select(conditions, choices)

I'm wondering if there is an easier way to create the conditions variable and am hoping to find an automated way to run each state_cat through (survey['state'].str.lower().str.contains('|'.join(alabama_cat), na=False)). I need to run this process for every state and possibly territories and instances where people input other countries.

Thanks so much for any insight.


Solution

  • Instead of checking for every cat, you can just try to extract any of the cat, then use map. Something like this:

    # map the codes to actual names
    state_codes = {code:choice for cat,choice in zip(state_cat, choices) 
                     for code in cat}
    
    patt = '|'.join(state_codes.keys())
    
    survey['state_category'] = survey['state'].str.extract(f'({patt})', expand=False).map(state_codes)