Search code examples
pythonpandasstringdataframecomparison

Use pandas to sequentially evaluate strings on multiple columns


I have a large location dataset that has many input errors (telemarketing). I need to create a column that clearly identifies a State and that information maybe present in different columns. The following toy data illustrates the issue.

import pandas as pd
import numpy as np

df = pd.DataFrame({'State': [np.nan, np.nan, np.nan, np.nan, 
                             np.nan, 'California', np.nan, np.nan],
                   'Location': ['Seattle, Washington', 'California', 'INPUT ERROR', 'Portland, Oregon',
                              'INPUT ERROR', 'San Bernardino', 'ERROR', 'Seattle'],
                   'Origin': ['Portland, Oregon', 'San Francisco', 'RANDOM ERROR', 'INPUT ERROR',
                                   'Las Vegas, Nevada', 'Nevada, Barstow', 'Portland', 'Washington, Tacoma']})

states = ['Alaska', 'Arizona', 'California', 
          'Colorado', 'Hawaii', 'Idaho', 
          'Montana', 'Nevada', 'New Mexico', 
          'Oregon', 'Utah', 'Washington',
          'Wyoming']

I need to somehow fill the 'State' column with whichever value appears first (really does not matter!) from either the 'State' column or the 'Location' or 'Origin' columns. My thinking was to create a list of states and the compare each row and column to find a match and place the first match as the value in the 'State' column.

To achieve this i thought i would need to first split the strings in 'Location' and 'Origin' to single strings and then compare sequentially (State>Location>Origin) till a valid state is found... got stuck there.

# Split string and create new columns for comparison
df[['Location1','Location2']] = df['Location'].str.split(', ', n=1, expand=True)
df[['origin1','origin2']] = df['Origin'].str.split(', ', n=1, expand=True)

Resulting in the following tabledataframe

This is what i was trying:

# Go through each column one by one...
df.loc[df['Location2'].isin(states), 'State'] = df['Location2']
df.loc[df['origin1'].isin(states), 'State'] = df['origin1']
# Etc...

Notice that, in this case entry index 5 gets changed from California (Already in 'State' col) to Nevada (Valid State) as there is no sequence in the evaluation.

Summarizing i need to eval State, Location and Origin columns strings, sequentially to a valid state list, whichever first string is valid gets put in the 'State' column as the result.

Thanks!


Solution

  • Try to extract a state using pat in Location then Origin columns. At each time, update nan values from State column with found states using combine_first.

    pat = fr"({'|'.join(states)})"
    
    df['State'] = \
        df['State'].combine_first(df['Location'].str.extract(pat, expand=False)) \
                   .combine_first(df['Origin'].str.extract(pat, expand=False))
    

    Regular expression find

    >>> df
            State             Location              Origin
    0  Washington  Seattle, Washington    Portland, Oregon
    1  California           California       San Francisco
    2         NaN          INPUT ERROR        RANDOM ERROR
    3      Oregon     Portland, Oregon         INPUT ERROR
    4      Nevada          INPUT ERROR   Las Vegas, Nevada
    5  California       San Bernardino     Nevada, Barstow
    6         NaN                ERROR            Portland
    7  Washington              Seattle  Washington, Tacoma
    
    >>> pat
    '(Alaska|Arizona|California|Colorado|Hawaii|Idaho|Montana|Nevada|New Mexico|Oregon|Utah|Washington|Wyoming)'