Search code examples
pythonalgorithmpandasfinance

how to identify specific sequences (round-trips) in a pandas dataset?


I have a simple, yet challenging algorithmic problem to solve.

I have a dataset at the trader - stock - day level, and I want to identify the round-trips in the data. Round-trips are just specific sequences in the data. That is, if you cumulate over time the holding position of stock s for individual i, a round-trip starts and ends with a zero net holding position.

An example:

pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],
              'stock' : ['a','a','a','a','a','a','b'],
              'day' :[0,1,2,4,5,10,1],
              'delta':[10,-10,15,-10,-5,5,6],
              'cumq' : [10,0,15,5,0,5,6] ,
              'tag': [1,1,2,2,2,0,0]})

output

 Out[15]: 
   cumq  day  delta stock  tag trader
0    10    0     10     a    1      a
1     0    1    -10     a    1      a
2    15    2     15     a    2      a
3     5    4    -10     a    2      a
4     0    5     -5     a    2      a
5     5   10      5     a    0      a
6     6    1      6     b    0      a

Here, you can see that trader a has bought two stocks (a and b). delta corresponds to the quantity purchased or sold on a given day. So you can see that we have two full round-trips on stock a (-10 +10 = 0 ) and (+15 - 10 -5 =0) and two non-completed round-trips (+5 on stock a, and +6 on stock b).

I would like to be able to create a variable tag that tells me this. That is, a variable that is 0 whenever a round-trip is not terminated, and that takes the values 1, 2, 3 for the first, second, third, etc round-trip identified in a given stock-trader combination.

Any idea how to do that in an efficient way? Many thanks!

EDIT:

  • short sales are possible. For instance sell 10, buy 25, sell 25 and buy 10 would be one round trip.
  • a round-trip can be implemented during the same day: in that case the delta will be zero (indeed, you buy and sell the same quantity that day). This means that several daily round-trip can follow each other in the data. So the 0 boundaries for the round-trips are inclusive. Example: you buy 10 and sell 10 today (delta = 0), then you buy 15 and sell 15 tomorrow (delta = 0 again). These correspond to two different consecutive round-trips

PROPOSED SOLUTION - SEEMS TO WORK - VERY SLOW

import pandas as pd
import numpy as np

df = pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]})

def proc_trader(_df):
    _df['tag'] = np.nan
    # make every ending of a roundtrip with its index
    _df['tag'][df.cumq == 0] = range(1, (_df.cumq == 0).sum() + 1)
    # backfill the roundtrip index until previous roundtrip;
    # then fill the rest with 0s (roundtrip incomplete for most recent trades)
    _df['tag'] = _df['tag'].fillna(method = 'bfill').fillna(0)
    return _df

df.sort_values(by=['trader', 'stock','day'], inplace=True)
df['cumq']=df.groupby(['trader', 'stock']).delta.transform('cumsum')
df = df.groupby(['trader', 'stock']).apply(proc_trader)

Any ideas? Many thanks!!!!


Solution

  • I would do something like this:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,6], 'cumq' : [10,0,15,5,0,6,11] ,'tag': [1,1,2,2,2,0,0]})
    
    def proc_trader(_df):
        if _df.shape[0] == 1:
            _df['tag'] = _df['delta'] == 0
            return _df
    
        _df['tag'] = np.nan
        # make every ending of a roundtrip with its index
        _df['tag'][df.cumq == 0] = range(1, (df.cumq == 0).sum() + 1)
        # backfill the roundtrip index until previous roundtrip;
        # then fill the rest with 0s (roundtrip incomplete for most recent trades)
        _df['tag'] = _df['tag'].fillna(method = 'bfill').fillna(0)
        return _df
    
    df = df.groupby(['trader', 'stock']).apply(proc_trader)
    

    The idea is that each trader's record is comprised of a series of roundtrips except possible in the end; this assumes something about how zero-delta periods should be counted. We detect all ends of roundtrip periods, backprop the index and fill the rest with 0s.