I have a dataframe with a column of open response strings that identify a state in the US (hopefully, this will become a closed ended question soon). I need to assign each response with a state name and am currently working with the following code.
alabama_cat = ["alabama", "al"]
alaska_cat = ["alaska", "ak"]
newyork_cat = ["new york", "ny", "newyork"]
state_cat = [alabama_cat, alaska_cat, newyork_cat]
#Conditions for categories
conditions = [
(survey['state'].str.lower().str.contains('|'.join(alabama_cat), na=False)),
(survey['state'].str.lower().str.contains('|'.join(alaska_cat), na=False)),
(survey['state'].str.lower().str.contains('|'.join(newyork_cat), na=False)),
]
#Names of categories
choices = ["Alabama", "Alaska", "New York"]
# categorize
survey['state_category'] = np.select(conditions, choices)
I'm wondering if there is an easier way to create the conditions variable and am hoping to find an automated way to run each state_cat through (survey['state'].str.lower().str.contains('|'.join(alabama_cat), na=False))
. I need to run this process for every state and possibly territories and instances where people input other countries.
Thanks so much for any insight.
Instead of checking for every cat
, you can just try to extract any of the cat, then use map
. Something like this:
# map the codes to actual names
state_codes = {code:choice for cat,choice in zip(state_cat, choices)
for code in cat}
patt = '|'.join(state_codes.keys())
survey['state_category'] = survey['state'].str.extract(f'({patt})', expand=False).map(state_codes)