Search code examples
pythonregexdictionarytextpython-re

python sub state names for abbrev via python dict with re.sub


I have a dataframe with a column in it containing state names. The names are a mix of official abbreviations and partial spellings and complete state names.

d = pd.DataFrame(['fla', 'fl', 'del', 'ohio', 'calif', 'ca', 'del', 'texas', 'miss', 'tx', 'new mex'],
                 columns = ["state"])

There is a python dict with state abbrevs and names here: https://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/

I would like to look in the dataframe d and find the best match in the dict and substitute for the values in d['state']. I don't think I want to use replace because i want to replace the "whole word" rather than the substring. The desired result:

d = ['fl', 'fl', 'de', 'oh', 'ca', 'ca', 'de', 'tx', 'ms', 'tx', 'nm']

Loading the dict directly into my console, and calling it states_dict, I tried the following (based on this map US state name to two letter acronyms that was given in dictionary separately)

d['state'] = d['state'].map(states_dict)

which produced nan for every entry in my dataframe, d.

Any help would be much appreciated.

Thanks.


Solution

  • This seems to work: search for the state name by taking each value from d['state'] and adding \w* between every letter and use match to search from the beginning of each dictionary value in a case insensitive way. Once found, return the found value lowercased.

    And I think miss must be mo, not mx.

    import pandas as pd
    import re
    d = pd.DataFrame(['fla', 'fl', 'del', 'ohio', 'calif', 'ca', 'del', 'texas', 'miss', 'tx', 'new mex', 'NY', 'NJ', 'NM', 'NC'], columns = ["state"])
    states = {
            'AK': 'Alaska',
            'AL': 'Alabama',
            'AR': 'Arkansas',
            'AS': 'American Samoa',
            'AZ': 'Arizona',
            'CA': 'California',
            'CO': 'Colorado',
            'CT': 'Connecticut',
            'DC': 'District of Columbia',
            'DE': 'Delaware',
            'FL': 'Florida',
            'GA': 'Georgia',
            'GU': 'Guam',
            'HI': 'Hawaii',
            'IA': 'Iowa',
            'ID': 'Idaho',
            'IL': 'Illinois',
            'IN': 'Indiana',
            'KS': 'Kansas',
            'KY': 'Kentucky',
            'LA': 'Louisiana',
            'MA': 'Massachusetts',
            'MD': 'Maryland',
            'ME': 'Maine',
            'MI': 'Michigan',
            'MN': 'Minnesota',
            'MO': 'Missouri',
            'MP': 'Northern Mariana Islands',
            'MS': 'Mississippi',
            'MT': 'Montana',
            'NA': 'National',
            'NC': 'North Carolina',
            'ND': 'North Dakota',
            'NE': 'Nebraska',
            'NH': 'New Hampshire',
            'NJ': 'New Jersey',
            'NM': 'New Mexico',
            'NV': 'Nevada',
            'NY': 'New York',
            'OH': 'Ohio',
            'OK': 'Oklahoma',
            'OR': 'Oregon',
            'PA': 'Pennsylvania',
            'PR': 'Puerto Rico',
            'RI': 'Rhode Island',
            'SC': 'South Carolina',
            'SD': 'South Dakota',
            'TN': 'Tennessee',
            'TX': 'Texas',
            'UT': 'Utah',
            'VA': 'Virginia',
            'VI': 'Virgin Islands',
            'VT': 'Vermont',
            'WA': 'Washington',
            'WI': 'Wisconsin',
            'WV': 'West Virginia',
            'WY': 'Wyoming'
    }
    
    def best_match(x):
        if len(x) == 2: # Try another way for 2-letter codes
            for a,n in states.items():
                if len(n.split()) == 2:
                    if "".join([c[0] for c in n.split()]).lower() == x.lower():
                        return a.lower()
        new_rx = re.compile(r"\w*".join([ch for ch in x]), re.I)
        for a,n in states.items():
            if new_rx.match(n):
                return a.lower()
            
    d['state_corrected'] = d['state'].apply(lambda x: best_match(x))
    

    Results:

          state state_corrected
    0       fla              fl
    1        fl              fl
    2       del              de
    3      ohio              oh
    4     calif              ca
    5        ca              ca
    6       del              de
    7     texas              tx
    8      miss              mo
    9        tx              tx
    10  new mex              nm
    11       NY              ny
    12       NJ              nj
    13       NM              nm
    14       NC              nc