I have a large location dataset that has many input errors (telemarketing). I need to create a column that clearly identifies a State and that information maybe present in different columns. The following toy data illustrates the issue.
import pandas as pd
import numpy as np
df = pd.DataFrame({'State': [np.nan, np.nan, np.nan, np.nan,
np.nan, 'California', np.nan, np.nan],
'Location': ['Seattle, Washington', 'California', 'INPUT ERROR', 'Portland, Oregon',
'INPUT ERROR', 'San Bernardino', 'ERROR', 'Seattle'],
'Origin': ['Portland, Oregon', 'San Francisco', 'RANDOM ERROR', 'INPUT ERROR',
'Las Vegas, Nevada', 'Nevada, Barstow', 'Portland', 'Washington, Tacoma']})
states = ['Alaska', 'Arizona', 'California',
'Colorado', 'Hawaii', 'Idaho',
'Montana', 'Nevada', 'New Mexico',
'Oregon', 'Utah', 'Washington',
'Wyoming']
I need to somehow fill the 'State' column with whichever value appears first (really does not matter!) from either the 'State' column or the 'Location' or 'Origin' columns. My thinking was to create a list of states and the compare each row and column to find a match and place the first match as the value in the 'State' column.
To achieve this i thought i would need to first split the strings in 'Location' and 'Origin' to single strings and then compare sequentially (State>Location>Origin) till a valid state is found... got stuck there.
# Split string and create new columns for comparison
df[['Location1','Location2']] = df['Location'].str.split(', ', n=1, expand=True)
df[['origin1','origin2']] = df['Origin'].str.split(', ', n=1, expand=True)
Resulting in the following table
This is what i was trying:
# Go through each column one by one...
df.loc[df['Location2'].isin(states), 'State'] = df['Location2']
df.loc[df['origin1'].isin(states), 'State'] = df['origin1']
# Etc...
Notice that, in this case entry index 5 gets changed from California (Already in 'State' col) to Nevada (Valid State) as there is no sequence in the evaluation.
Summarizing i need to eval State, Location and Origin columns strings, sequentially to a valid state list, whichever first string is valid gets put in the 'State' column as the result.
Thanks!
Try to extract a state using pat
in Location
then Origin
columns. At each time, update nan values from State
column with found states using combine_first
.
pat = fr"({'|'.join(states)})"
df['State'] = \
df['State'].combine_first(df['Location'].str.extract(pat, expand=False)) \
.combine_first(df['Origin'].str.extract(pat, expand=False))
Regular expression find
>>> df
State Location Origin
0 Washington Seattle, Washington Portland, Oregon
1 California California San Francisco
2 NaN INPUT ERROR RANDOM ERROR
3 Oregon Portland, Oregon INPUT ERROR
4 Nevada INPUT ERROR Las Vegas, Nevada
5 California San Bernardino Nevada, Barstow
6 NaN ERROR Portland
7 Washington Seattle Washington, Tacoma
>>> pat
'(Alaska|Arizona|California|Colorado|Hawaii|Idaho|Montana|Nevada|New Mexico|Oregon|Utah|Washington|Wyoming)'