Search code examples
pythonpandas

Python Pandas - Identify pairs in a dataframe based on both a string similarity in 2 columns AND a difference in another column


I hope I am explaining this correctly. I have a dataframe in which i need to identify pairs of rows based on the string value of two columns. Each row in the pair must have a different string value in another column. I then need to add a new value to a new column based on TRUE or FALSE condition of that third column AND the condition of the pairing.

For example. A simplified version of the df would be: Before

The end result would look like this: After, with new_col

Any help would be greatly appreciated.


Solution

  • Hoping this helps, using data similar to the examples you shared.

    data = {
        'name': ['John', 'John', 'Jane', 'Jane', 'Doe', 'Doe'],
        'city': ['LA', 'LA', 'SF', 'SF', 'SD', 'SD'],
        'item': ['Peanut Butter', 'Jelly', 'Peanut Butter', 'Peanut Butter', 
        'Jelly', 'Jelly']
    }
    
    df = pd.DataFrame(data)
    

    We can then create a dictionary of the string pairs from name and city with the values being a list of the items associated with the pairs.

    unique_dict = df.groupby(['name', 'city'])['item'].apply(list).to_dict()
    

    Once we have this we define a function to handle the logic of what each pair needs and can apply it to the dataframe.

    def determine_needs(items):
        if 'Peanut Butter' in items and 'Jelly' in items:
            return None
        elif 'Peanut Butter' in items:
            return 'Jelly'
        elif 'Jelly' in items:
            return 'Peanut Butter'
        else:
            return None
    
    df['needs'] = df.apply(lambda row: determine_needs(unique_dict[(row['name'], row['city'])]), axis=1)