Search code examples
python-3.xpandasdataframegroup-by

Delete the rows with same entires corresponding to specific groups in Python


I have the following dataset:

#Load the required libraries
import pandas as pd


#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1,1, 1, 1, 1, 1,
               2, 2, 2, 2, 2, 2,2,2,2,2,
               3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        'Salary': [7, 7, 7, 7, 7,7,7,8,9,10,11,12,13,14,15,
                   4, 4, 4,4,5,6,7,8,9,10,
                   8, 8, 8, 8,8,8,8,8,9,10,11,12,13],
        'Children': ['No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
                     'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No',
                     'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No','Yes', 'Yes', 'No','No', 'Yes'],
        'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,
                 96, 120,120, 141, 52,96,128, 66, 120, 141,
                 120, 141,123,15,85,36,58,89,123, 128, 66, 120, 141],
        }

#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)

The above dataset looks as such:

enter image description here

I wish to delete the rows with same entries from 'Salary' with respect to specific group/id.

For example,

For id = 1, delete rows with 'Salary'=7, except the last entry.

For id = 2, delete rows with 'Salary'=4, except the last entry.

For id = 3, delete rows with 'Salary'=8, except the last entry.

The new dataset should look as such:

enter image description here

Can somebody please let me know hot to achieve this task in Python?


Solution

  • Use boolean indexing with chain by & for bitwise AND and | for bitwise OR with DataFrame.duplicated:

    mask = ((df['id'].ne(1) & df['Salary'].ne(7) &
             df['id'].ne(2) & df['Salary'].ne(4) &
             df['id'].ne(3) & df['Salary'].ne(8)) | 
             ~df.duplicated(['id','Salary'], keep='last'))
    
    df = df[mask]
    print("df = \n", df)
         id  Salary Children  Days
    6    1       7      Yes   128
    7    1       8      Yes    66
    8    1       9      Yes   120
    9    1      10       No   141
    10   1      11       No    52
    11   1      12      Yes    96
    12   1      13      Yes   120
    13   1      14      Yes   141
    14   1      15       No    52
    18   2       4      Yes   141
    19   2       5      Yes    52
    20   2       6      Yes    96
    21   2       7      Yes   128
    22   2       8      Yes    66
    23   2       9      Yes   120
    24   2      10       No   141
    32   3       8       No    89
    33   3       9      Yes   123
    34   3      10      Yes   128
    35   3      11       No    66
    36   3      12       No   120
    37   3      13      Yes   141
    

    EDIT: If need only removed all duplicates with keep last rows per id,Salary columns use DataFrame.drop_duplicates :

    df = df.drop_duplicates(subset=['id','Salary'], keep='last')