I have a simple, yet challenging algorithmic problem to solve.
I have a dataset at the trader - stock - day level, and I want to identify the round-trips in the data. Round-trips are just specific sequences in the data. That is, if you cumulate over time the holding position of stock s for individual i, a round-trip starts and ends with a zero net holding position.
An example:
pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],
'stock' : ['a','a','a','a','a','a','b'],
'day' :[0,1,2,4,5,10,1],
'delta':[10,-10,15,-10,-5,5,6],
'cumq' : [10,0,15,5,0,5,6] ,
'tag': [1,1,2,2,2,0,0]})
output
Out[15]:
cumq day delta stock tag trader
0 10 0 10 a 1 a
1 0 1 -10 a 1 a
2 15 2 15 a 2 a
3 5 4 -10 a 2 a
4 0 5 -5 a 2 a
5 5 10 5 a 0 a
6 6 1 6 b 0 a
Here, you can see that trader a
has bought two stocks (a
and b
). delta
corresponds to the quantity purchased or sold on a given day. So you can see that we have two full round-trips on stock a
(-10 +10 = 0 ) and (+15 - 10 -5 =0) and two non-completed round-trips (+5 on stock a
, and +6 on stock b
).
I would like to be able to create a variable tag
that tells me this. That is, a variable that is 0 whenever a round-trip is not terminated, and that takes the values 1, 2, 3 for the first, second, third, etc round-trip identified in a given stock-trader combination.
Any idea how to do that in an efficient way? Many thanks!
EDIT:
PROPOSED SOLUTION - SEEMS TO WORK - VERY SLOW
import pandas as pd
import numpy as np
df = pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]})
def proc_trader(_df):
_df['tag'] = np.nan
# make every ending of a roundtrip with its index
_df['tag'][df.cumq == 0] = range(1, (_df.cumq == 0).sum() + 1)
# backfill the roundtrip index until previous roundtrip;
# then fill the rest with 0s (roundtrip incomplete for most recent trades)
_df['tag'] = _df['tag'].fillna(method = 'bfill').fillna(0)
return _df
df.sort_values(by=['trader', 'stock','day'], inplace=True)
df['cumq']=df.groupby(['trader', 'stock']).delta.transform('cumsum')
df = df.groupby(['trader', 'stock']).apply(proc_trader)
Any ideas? Many thanks!!!!
I would do something like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,6], 'cumq' : [10,0,15,5,0,6,11] ,'tag': [1,1,2,2,2,0,0]})
def proc_trader(_df):
if _df.shape[0] == 1:
_df['tag'] = _df['delta'] == 0
return _df
_df['tag'] = np.nan
# make every ending of a roundtrip with its index
_df['tag'][df.cumq == 0] = range(1, (df.cumq == 0).sum() + 1)
# backfill the roundtrip index until previous roundtrip;
# then fill the rest with 0s (roundtrip incomplete for most recent trades)
_df['tag'] = _df['tag'].fillna(method = 'bfill').fillna(0)
return _df
df = df.groupby(['trader', 'stock']).apply(proc_trader)
The idea is that each trader's record is comprised of a series of roundtrips except possible in the end; this assumes something about how zero-delta periods should be counted. We detect all ends of roundtrip periods, backprop the index and fill the rest with 0s.