Search code examples
pythonpandasdataframeseries

Check if a Series is already in a Dataframe


Let´s say you have some students

students = [ ['Jack', 34, 'Sydeny'] ,
             ['Riti', 30, 'Delhi' ] ,
             ['Aadi', 16, 'New York'] ]
dfObj = pd.DataFrame(students, columns = ['Name', 'Age', 'City'])

And now you receive a series like this:

s = pd.Series(['Riti', 30, 'Delhi'], index=['Name', 'Age', 'City'])

I could now use .loc to filter for the criteria like this:

filtered_dfObj = dfObj.loc[(dfObj['Name'] == s['Name']) & (dfObj['Age'] == s['Age'])]
filtered_dfObj = filtered_dfObj.loc[filtered_dfObj['City'] == s['City']]

But if I have a lot of columns the filter code would grow very fast. So it would be the best if there would be an option like s.isin(dfObj)


Update after 5 answers: These are all good answers - Thanks! I did not do any speedtests between the different approches yet. I personally go with this solution, because it is most-flexible regarding column-selection (if it is needed).


Solution

  • Consider the following approach:

    # number of full duplicates (rows)
    print((dfObj == s).all(axis=1).sum())
    

    If you wanna check only some columns then you may add filter by column names like:

    flt = ['Name', 'Age']
    # number of partial duplicates (rows)
    print((dfObj[flt] == s[flt]).all(axis=1).sum())