Search code examples
pythonpandasfor-loopif-statementsum

Python: Aggregate dataframe values according to condition depending on series sum


I have the following dummy data:

data = {"trip_id": ["a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c", "c"],
        "distance": [5, 10, 2, 12, 20, 1, 15, 2, 3, 5, 0, 28],
        "value": [32, 46, 132, 45, 68, 123, 56, 54, 32, 89, 10, 39],
        "timestamp": ["2021-05-09 10:12:18", "2021-06-19 08:22:18", "2021-01-09 15:20:00", "2021-01-09 18:30:18",
                "2021-04-09 24:11:18", "2021-02-19 08:12:18", "2021-06-11 08:56:18", "2021-06-11 08:54:18",
                "2021-06-09 18:12:18", "2021-06-19 16:10:18", "2021-06-03 18:12:18", "2021-06-12 08:22:18"]
       }

df = pd.DataFrame(data, columns = ["trip_id", "distance", "value", "timestamp"])
df

>>
trip_id distance    value   timestamp
0   a   5   32  2021-05-09 10:12:18
1   a   10  46  2021-06-19 08:22:18
2   a   2   132 2021-01-09 15:20:00
3   a   12  45  2021-01-09 18:30:18
4   b   20  68  2021-04-09 24:11:18
5   b   1   123 2021-02-19 08:12:18
6   b   15  56  2021-06-11 08:56:18
7   c   2   54  2021-06-11 08:54:18
8   c   3   32  2021-06-09 18:12:18
9   c   5   89  2021-06-19 16:10:18
10  c   0   10  2021-06-03 18:12:18
11  c   28  39  2021-06-12 08:22:18

I would like to compact the values per trip_id every 30m of the distance column. For this, I am trying:

current_trip_id = ""
dist_segment = []

for index,row in df.iterrows():
    
    if row["trip_id"] == current_trip_id:
        
        if sum(dist_segment) <= 30:
            # compute trip characteristics every 30m
            sorted_trip_char = df.sort_values(["trip_id", "timestamp"])
            grouped_trip_char = sorted_trip_char.groupby("trip_id")
            
            agg_trip_char = grouped_trip_char.agg({'trip_id': 'first',
                                       'distance':'sum',
                                       'value': 'mean',
                                       'timestamp':['first', 'last']
                                      })
            
        else:
            # store trip characteristics to another line up to 30m
            print("More than 30m in index", index)
            
            
    
    # Update loop
    current_trip_id = row["trip_id"]
    dist_segment.append(row["distance"])

However, I can't see how I can still preserve the id and jump into a new aggregation of values.

This is the output I'm after:

agg_trip_char
>>

          trip_id   distance    value      timestamp
          first     sum         mean       first                  last
trip_id                 
a         a         29         63.75        2021-01-09 15:20:00   2021-06-19 08:22:18
b         b         21         95.50        2021-02-19 08:12:18   2021-06-11 08:56:18
b         b         15         56           2021-06-11 08:56:18   2021-06-11 08:56:18
c         c         10         46.25        2021-06-11 08:54:18   2021-06-03 18:12:18
c         c         28         39           2021-06-12 08:22:18   2021-06-12 08:22:18

Solution

  • You can modify your codes as follows:

    1. sort the columns
    df_out = df.sort_values(["trip_id", "timestamp"])
    
    1. Create groups of every 30m distance within the same trip_id
    g = df_out.groupby('trip_id')['distance'].cumsum() // 30
    

    Here, we get the cummulative sum of distance within a particular trip_id by using .groupby() + .cumsum(). Then, use integer division by 30 to divide every 30m into different group numbers.

    1. Aggregate the columns
    df_out2 = (df_out.groupby(['trip_id', g])
                     .agg({'trip_id': 'first',
                           'distance':'sum',
                           'value': 'mean',
                           'timestamp':['first', 'last']
                          }).droplevel(-1)
              )                 
    

    Result:

    print(df_out2)
    
            trip_id distance  value            timestamp                     
              first      sum   mean                first                 last
    trip_id                                                                  
    a             a       29  63.75  2021-01-09 15:20:00  2021-06-19 08:22:18
    b             b       21  95.50  2021-02-19 08:12:18  2021-04-09 24:11:18
    b             b       15  56.00  2021-06-11 08:56:18  2021-06-11 08:56:18
    c             c        5  32.00  2021-06-03 18:12:18  2021-06-11 08:54:18
    c             c       33  64.00  2021-06-12 08:22:18  2021-06-19 16:10:18
    

    If we don't sort the columns, the result will look more alike to your expected output:

    #df_out = df.sort_values(["trip_id", "timestamp"])
    df_out = df
    g = df_out.groupby('trip_id')['distance'].cumsum() // 30
    df_out2 = (df_out.groupby(['trip_id', g])
                     .agg({'trip_id': 'first',
                           'distance':'sum',
                           'value': 'mean',
                           'timestamp':['first', 'last']
                          }).droplevel(-1)
              )                 
    

    Result:

    print(df_out2)
    
            trip_id distance  value            timestamp                     
              first      sum   mean                first                 last
    trip_id                                                                  
    a             a       29  63.75  2021-05-09 10:12:18  2021-01-09 18:30:18
    b             b       21  95.50  2021-04-09 24:11:18  2021-02-19 08:12:18
    b             b       15  56.00  2021-06-11 08:56:18  2021-06-11 08:56:18
    c             c       10  46.25  2021-06-11 08:54:18  2021-06-03 18:12:18
    c             c       28  39.00  2021-06-12 08:22:18  2021-06-12 08:22:18
    

    However, it seems reasonable to sort the trip_id and timestamp if you need to follow the time sequence. Hence, please review your real-life situation and choose from the options accordingly.