Search code examples
pythonpandasaggregation

Resample and aggregate data according to another column value


My time series is something like this:

TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice
1,09:25:00,137.69,200,200,453,B,182023,137.69,241939,137.69
2,09:25:00,137.69,253,300,453,S,184857,137.69,241939,137.69
3,09:25:00,137.69,47,300,200,B,184857,137.69,241322,137.69
4,09:25:00,137.69,153,200,200,B,219208,137.69,241322,137.69

I want to resample and aggregate the dataframe by volume, but in result, I should be able to get something like:

Time, Volume_B, Volume_S
09:25:00, 400, 253

Volume_B is aggregated volume when the Type is 'B', and Volume_S is aggregated when its Type is 'S'.

My function is something like below, but it doesn't work well.

data.resample('t').agg(Volume_B=(Volume=lambda x: np.where(x['Type']=='B', x['Volume'], 0)), Volume_A=(Volume=lambda x: np.where(x['Type']=='S', x['Volume'], 0)))

How to properly implement this?


Solution

  • One way is to create the columns Volume_B (and _S) before with np.where like you did, then aggregate, so:

    res = (
        df.assign(Volume_B= lambda x: np.where(x['Type']=='B', x['Volume'], 0), 
                  Volume_S= lambda x: np.where(x['Type']=='S', x['Volume'], 0))\
          .groupby(df['Time']) # you can replace by resample here
          [['Volume_B','Volume_S']].sum()
          .reset_index()
    )
    print(res)
           Time  Volume_B  Volume_S
    0  09:25:00       400       253
    

    Edit, with your input like that (and aggregating on Time column), then you could also do a pivot_table like:

    (df.pivot_table(index='Time', columns='Type', 
                    values='Volume', aggfunc=sum)
       .add_prefix('Volume_')
       .reset_index()
       .rename_axis(columns=None)
    )