Search code examples
pythonpandasdata-analysis

Python Pandas Data Frame search secondary relations between rows


For example, I have a DataFrame in Python, naming people, with people_name, fav_color, best_friend_name. How should my query be so that I can add a new column that is the favorite color of their best friend, ie best_friend_fav_color?

My data is like:

{'Name' : ['John', 'Leo', 'Mary'],
 'fav_color' : ['Red', 'Blue', 'Green'],
 'Best_friend_name' : ['Leo', 'Mary', 'John']}

and the expected output would be like:

{'Name' : ['John', 'Leo', 'Mary'],
 'fav_color' : ['Red', 'Blue', 'Green'],
 'Best_friend_name' : ['Leo', 'Mary', 'John'], 
 'Best_friend_fav_color' = ['Blue', 'Green', 'Red']}

And we can assume that the best friends will always be in the data frame.


Solution

  • This is fairly easy to do:

    First copy your data to make two DataFrames:

    df2 = df[['Name', 'fav_color']].copy()
    

    Next, join the two DataFrames on the person's name and best friend's name:

    res = df.merge(df2, how='left', left_on='Best_friend_name', right_on='Name')
    

    res will contain the output you are looking for. It's columns will require renaming. You will notice that you will have fav_color_x and fav_color_y. To rename, just assign the correct names to res.columns, like so:

    res.columns = ['Name', 'fav_color', 'Best_friend_name', 'Name2', 'Best_friend_fav_color']
    

    You notice the two Name columns? You can simply drop 'Name2'.