Search code examples
pythonpandasdataframeintersect

Finding common rows (intersection) in two Pandas dataframes


Assume I have two dataframes of this format (call them df1 and df2):

+------------------------+------------------------+--------+
|        user_id         |      business_id       | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 |
+------------------------+------------------------+--------+

I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)

I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.

Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.


Solution

  • My understanding is that this question is better answered over in this post.

    But briefly, the answer to the OP with this method is simply:

    s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
    

    Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.