Search code examples
pythonpandasaggregation

Python aggregate time series using a complex function that depends on the value from anther column


My time series is something like this:

TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice
1,09:25:00,137.69,200,200,453,B,182023,137.69,241939,137.69
2,09:25:00,137.69,253,300,453,S,184857,137.69,241939,137.69
3,09:25:00,137.69,47,300,200,B,184857,137.69,241322,137.69
4,09:25:00,137.69,153,200,200,B,219208,137.69,241322,137.69

I can do the aggregation by suming all the Volume

res = df.resample('t').agg({'Volume': 'sum'})

But I would like to aggregate the volume and type columns according to volume and type column in the way that when type is S then add the volume otherwise remove the volume. If the total volume after aggregation is negative then type is S otherwise type is B.

In the above example, after I aggregate the volume, the total volume would become

200 - 253 + 300 + 200 = 447

and the type is B since 447 > 0

result:

Time,Volume,Type
09:25:00,447,B

Solution

  • The simplest is to multiply the volume by 1 or -1 depending on the value in Type with map. then assign the column Type depending on the result of the summed volume.

    res = (
        (df['Volume']*df['Type'].map({'S':-1, 'B':1}))
          .groupby(df['Time']).sum()#here should work with resample, 
                                    #just your input is not the right format to use resample
          .reset_index(name='Volume')
          .assign(Type=lambda x: np.where(x['Volume']>0, 'B', 'S'))
    )
    
    print(res)
           Time  Volume Type
    0  09:25:00     147    B # you used 2 columns to calculate your result volume 447?