Search code examples
pythonpandasdataframedrop-duplicates

Python Dataframe: Dropping duplicates base on certain conditions


Dataframe with duplicate Shop IDs where some Shop IDs occurred twice and some occurred thrice:
I only want to keep unique Shop IDs base on the shortest Shop Distance assigned to its Area.

    Area  Shop Name  Shop Distance  Shop ID   

0   AAA   Ly         86             5d87790c46a77300
1   AAA   Hi         230            5ce5522012138400
2   BBB   Hi         780            5ce5522012138400
3   CCC   Ly         450            5d87790c46a77300
...
91  MMM   Ju         43             4f76d0c0e4b01af7
92  MMM   Hi         1150           5ce5522012138400
...

Using pandas drop_duplicates drop the row duplicates but the condition is base on the first/ last occurring Shop ID which does not allow me to sort by distance:

shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')

I also tried to group by Shop ID then sort, but sort returns error: Duplicates

bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)

So far i tried doing up till this stage:

# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]

# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()

# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index

# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index

# create a mask for the Shops that are under radius 750 
dist_1 = df_toclean['Shop Distance']<=750

# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]

* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results

I think i'm doing it the long way and still haven't figure out dropping the duplicates, anyone knows how to solve this in a shorter way? I'm new to python, thanks for helping out!


Solution

  • Try to first sort the dataframe based on distance, then drop the duplicate shops.

    df = shops_df.sort_values('Distance')
    df = df[~df['Shop ID'].duplicated()]  # The tilda (~) inverts the boolean mask.
    

    Or just as one chained expression (per comment from @chmielcode).

    df = (
        shops_df
        .sort_values('Distance')
        .drop_duplicates(subset='Shop ID', keep= 'first')
        .reset_index(drop=True)  # Optional.
    )