For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})
lat lon time duration size
0 49.125 -114.125 2005-08-09 3 4
1 49.125 -114.125 2005-08-09 6 10
I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.
edit: so far I know how to get the highest duration value to remain using:
z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')
If those are all the columns in your dataframe, you can get your result using a groupby
on your time column, and passing in your aggregations for each column.
More specifically, you can drop the (duplicate) instance which has the lowest 'duration' by keeping the max()
duration, and at the same time sum the 'size' variables by using sum()
on your size column.
res = z.groupby('time').agg({'lat':'first',
'lon':'first',
'duration':'max',
'size':'sum'}). \
reset_index()
res
time lat lon duration size
0 2005-08-09 49.125 -114.125 6 14
The only difference is that 'time' is now your first column, which you can quickly fix.