I have a pandas dataframe like this one, where I can have rows with same combination of long and lat:
Initial df:
lon lat name value protection a b c score
0 20 10 canada 563 NaN cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 canada 893 NaN dog NaN NaN 20.0
3 40 20 usa 4 NaN horse horse lion 40.0
4 45 15 usa 8593 NaN NaN lion cat 10.0
5 20 10 protection1 100 medium NaN NaN NaN NaN
6 40 20 protection1 20 high NaN NaN NaN NaN
7 50 30 protection1 500 low NaN NaN NaN NaN
but what I want is:
wanted output:
lon lat protection a b c score
0 20 10 medium cat dog elephant 20.0
1 30 10 NaN lion tiger cat 30.0
2 40 20 high horse horse lion 40.0
3 45 15 NaN NaN lion cat 10.0
4 50 30 low NaN NaN NaN NaN
The output dataframe should contain rows with a unique combination of long
and lat
columns where only the row with the highest score
is retained but if long
and lat
have duplicates and a value in the protection
column these should be join into one
Try:
df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)
print(df_out)
Prints:
lon lat name value protection a b c score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN