Search code examples
pythonpandasdataframedrop-duplicates

Summing values of (dropped) duplicate rows Pandas DataFrame


For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.

z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})

    lat     lon         time        duration    size
0   49.125  -114.125    2005-08-09  3           4
1   49.125  -114.125    2005-08-09  6           10

I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:

    lat     lon         time        duration    size
0   49.125  -114.125    2005-08-09  6           14

Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.

edit: so far I know how to get the highest duration value to remain using:

z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')

Solution

  • If those are all the columns in your dataframe, you can get your result using a groupbyon your time column, and passing in your aggregations for each column.

    More specifically, you can drop the (duplicate) instance which has the lowest 'duration' by keeping the max() duration, and at the same time sum the 'size' variables by using sum() on your size column.

    res = z.groupby('time').agg({'lat':'first',
                                               'lon':'first',
                                               'duration':'max',
                                               'size':'sum'}).  \
        reset_index()
    

    res
    
            time     lat      lon  duration  size
    0 2005-08-09  49.125 -114.125         6    14
    

    The only difference is that 'time' is now your first column, which you can quickly fix.