Search code examples
pythonpandasdataframecontainsdifflib

Iterate each Pandas df row and identify if row value is in list, if so pull that value into df


I have a pandas df with hand entered values for states around the world. I have a list of states values that are properly formatted and contain the correct syntax. I want to iterate through each row in the pandas df and compare the value per row against all values in the list of states to determine whether the value in the row is contained within any of the string values. If so, pull that value in from the string to a new df column called "match". If there are more than one string values that the pandas row is contained in then bring in both values in and have it create a list. Below is an example of what I mean.

Note: I can already do this with the difflib get_close_matches function. Posted the code below and output for that, want a way to replicate that but for the str.contains() ability in pandas.

state_df

states_list = ['Oregon', 'Texas', 'Colorado', Hawaii, 'Sonora', 'Alaska', 'Alabama','Accra', etc]

Outcome

enter image description here

How I use the get close matches to select the closest matches to the state entered values below. Want another column added in that has values from the states list that the row value string is contained in

enter image description here


Solution

  • Try the following:

    s = set([i.lower() for i in states_list])
    
    df['match'] = df['state_name'].apply(lambda x: list(set([i.strip().lower() for i in x.split(',')]).intersection(
    s)))
    
    df['match']=df['match'].apply(lambda x: [i[0].upper() + i[1:] for i in x])