Search code examples
pythonsortingconditional-statementsgrouping

create new df column based on grouping and conditions in PYTHON


my dataframe looks like the bottom and i want to create a new column called "remove" that would be true/false based on the following conditions.

i only want to keep one row for each "url" and want to determine which to keep based on this order:

  • keep the one with a higher score
  • if score is same then go to track "gre" and "tca" are favored over "oke"
  • if they are equal at the track then go to "size" and take the bigger one

I would like the new column to be true or false based on if it should be removed or not.

thanks in advance!!

name age track score size url url_count
0 timmy 10 gre 0.982 41 target.com 2
1 chester 12 tca 0.251 12 michaels.com 3
2 trixie 10 oke 0.0879 15 macys.com 2
3 aj 9 gre 0.985 20 michaels.com 3
4 cosmo 150 tca 0.982 328 target.com 2
5 wanda 145 gre 0.879 550 macys.com 2
6 jorgen 500 oke 0.432 51 michaels.com 3

Solution

  • Use sort_values with your sort criteria and a custom_sort as the key re-order your rows with the first being the highest ranked. Then we use duplicated on only the url column to flag all but the first instance of a url as to be removed:

    def custom_sort(x):
        if x.name == 'track':
            return x.map({'gre': 0, 'tca': 0, 'oke': 1})
        
        else:
            return x
    
    df['remove'] = df.sort_values(['url', 'score', 'track', 'size'], key = custom_sort, ascending = False).duplicated(subset = 'url')