Search code examples
pandasstringlistlambdareplace

Using lambda IF condition on columns in Pandas to extract and replace strings from a data frame comparing to a list


An example df

data = {'id':[1,2,3,4,5,6,7], 'location':['Havana,Cuba', 'Santiago de Cuba' , 'La Habana', 'Cuba', 'Havana, Cuba', 'Santiago, Chile', 'Chile']}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

I use the following to split the string in location column.

# Splits string into two new columns
df[['location_1', 'location_2']] = df['location'].str.split('[,|.]', 1, expand=True)

I need to further process by checking the value of LOCATION_1 against a list. If it meets one of the the STRING values from the list I need to replace the value of LOCATION_2 with the match list value and replace the LOCATION_1 value with "".

The example below (based on) is a start but needs to be conditional on a match with the list as it overwrites existing data in LOCATION_2. Another helpful solution is here but does not use a List.

sample_list=['Cuba','Chile']
df['location_2'] = df['location_1'].apply(lambda x: ''.join([L for L in sample_list if L in x]))
df['location_2'] = df['location_2'].mask(df['location_2'] == '')
print(df)

Finally, I am not trying to only extract "Cuba" and "Chile" but I also need to modify a string such as "Santiago de Cuba" to result in "Santiago de" in LOCATION_1 and "Cuba" in Location_2.

Final table should look something like this.

id          location   location_1 location_2
0   1       Havana,Cuba       Havana       Cuba
1   2  Santiago de Cuba  Santiago de       Cuba
2   3         La Habana    La Habana           
3   4              Cuba                    Cuba
4   5      Havana, Cuba       Havana       Cuba
5   6   Santiago, Chile     Santiago      Chile
6   7             Chile                   Chile

Solution

  • This should get you the answer your looking for IIUC

    data = {'id':[1,2,3,4,5,6,7], 'location':['Havana,Cuba', 'Santiago de Cuba' , 'La Habana', 'Cuba', 'Havana, Cuba', 'Santiago, Chile', 'Chile']}
    
    # Create DataFrame
    df = pd.DataFrame(data)
    sample_list=['Cuba','Chile']
    df[['location_1', 'location_2']] = df['location'].str.split('[,|.]', 1, expand=True)
    for i in range(0, len(sample_list)):
        df['location_2'] = np.where(df['location'].str.contains(sample_list[i]), sample_list[i], df['location_2'])
    df['location_2'] = df['location_2'].fillna(' ')
    df