Search code examples
pythonpandasmergexlookup

Python pandas merge map with multiple values xlookup


I have a dataframe of actor names:

df1

actor_id    actor_name
1   Brad Pitt
2   Nicole Kidman
3   Matthew Goode
4   Uma Thurman
5   Ethan Hawke

And another dataframe of movies that the actors were in:

df2

actor_id    actor_movie movie_revenue_m
1   Once Upon a Time in Hollywood   150
2   The Others  50
2   Moulin Rouge    200
3   Stoker  75
4   Kill Bill   125
5   Gattaca 85

I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:

df3 = df1.merge(df2, on = 'actor_id', how = 'left')

df3

actor_id    actor_name  actor_movie movie_revenue
1   Brad Pitt   Once Upon a Time in Hollywood   150
2   Nicole Kidman   Moulin Rouge    50
2   Nicole Kidman   The Others  200
3   Matthew Goode   Stoker  75
4   Uma Thurman Kill Bill   125
5   Ethan Hawke Gattaca 85

But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?

How would I merge the movie title that is alphabetically first?

How would I merge the movie title with the highest revenue?

Thank you!


Solution

  • One way is to continue with the merge and then filter the result set

    movie title that is alphabetically first

    # sort by name, movie and then pick the first while grouping by actor
    df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
    
        actor_id    actor_name  actor_movie     movie_revenue
    0   1   Brad Pitt   Once Upon a Time in Hollywood   150
    1   2   Nicole Kidman   Moulin Rouge    50
    2   3   Matthew Goode   Stoker  75
    3   4   Uma Thurman     Kill Bill   125
    4   5   Ethan Hawke     Gattaca     85
    

    movie title with the highest revenue

    # sort by name, and review (descending), groupby actor and pick first
    df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
    
        actor_id    actor_name  actor_movie     movie_revenue
    0   1   Brad Pitt   Once Upon a Time in Hollywood   150
    1   2   Nicole Kidman   The Others  200
    2   3   Matthew Goode   Stoker  75
    3   4   Uma Thurman     Kill Bill   125
    4   5   Ethan Hawke     Gattaca     85