my dataframe looks like the bottom and i want to create a new column called "remove" that would be true/false based on the following conditions.
i only want to keep one row for each "url" and want to determine which to keep based on this order:
I would like the new column to be true or false based on if it should be removed or not.
thanks in advance!!
name | age | track | score | size | url | url_count | |
---|---|---|---|---|---|---|---|
0 | timmy | 10 | gre | 0.982 | 41 | target.com | 2 |
1 | chester | 12 | tca | 0.251 | 12 | michaels.com | 3 |
2 | trixie | 10 | oke | 0.0879 | 15 | macys.com | 2 |
3 | aj | 9 | gre | 0.985 | 20 | michaels.com | 3 |
4 | cosmo | 150 | tca | 0.982 | 328 | target.com | 2 |
5 | wanda | 145 | gre | 0.879 | 550 | macys.com | 2 |
6 | jorgen | 500 | oke | 0.432 | 51 | michaels.com | 3 |
Use sort_values
with your sort criteria and a custom_sort
as the key re-order your rows with the first being the highest ranked. Then we use duplicated
on only the url
column to flag all but the first instance of a url as to be removed:
def custom_sort(x):
if x.name == 'track':
return x.map({'gre': 0, 'tca': 0, 'oke': 1})
else:
return x
df['remove'] = df.sort_values(['url', 'score', 'track', 'size'], key = custom_sort, ascending = False).duplicated(subset = 'url')