Search code examples
pythonpandasnumpyalgorithmic-tradingback-testing

Vectorising a for loop where logic spans rows


The table below shows an example of a strategy where a signal is generated in row 2 and then an opposite signal is generated in row 5.

     row     open_signal     close_signal      live
      1           0               0             0
      2           1               0             1
      3           0               0             1
      4           0               0             1
      5           0               1             0
      6           0               0             0

I want to optimise the calculation of the live column.

Is there a way to vectorise this problem in either Pandas or Numpy for improved performance, generating the same result as the for loop example below?

import pandas as pd
from datetime import datetime

example = {'date': [str(datetime(2017,1,1)), str(datetime(2017,1,2)),str(datetime(2017,1,3)),str(datetime(2017,1,4)),str(datetime(2017,1,5)),str(datetime(2017,1,6)),
                    str(datetime(2017,1,7)), str(datetime(2017,1,8)),str(datetime(2017,1,9)), str(datetime(2017,1,10)),str(datetime(2017,1,11)), str(datetime(2017,1,12)),
                    str(datetime(2017,1,13)),str(datetime(2017,1,14))],
           'open':        [142.11, 142.87, 141.87, 142.11, 142.00, 142.41, 142.50, 142.75, 140.87, 141.25, 141.10, 141.15, 142.55, 142.75],
           'close':       [142.87, 141.87, 142.11, 142.00, 142.41, 142.50, 142.75, 140.87, 141.25, 141.10, 141.15, 142.55, 142.75, 142.11],
           'open_signal': [False,  False,  False,  False,  False,  True,  False,  False,  False,  False,  False,  False,  False,  False],
           'close_signal':[False,  False,  False,  False,  False,  False,  False,  False,  False,   True,  False,  False,  False,  False]
           }

data = pd.DataFrame(example)

in_trade = False
for i in data.iterrows():
    if i[1].open_signal:
        in_trade = True
    if i[1].close_signal:
        in_trade = False
    data.loc[i[0],'in_trade'] = in_trade

Solution

  • Simplistic case

    For the simplistic case as in the posted sample, here's one vectorized way with NumPy -

    ar = np.zeros(len(data), dtype=int)
    ar[data.open_signal.values] = 1
    ar[data.close_signal.values] = -1
    data['out'] = ar.cumsum().astype(bool)
    

    Runtime test -

    Using the sample dataset and scaling it 100000 times along the rows for the testing.

    In [191]: data = pd.concat([data]*100000,axis=0); data.index = range(len(data))
    
    # @Dark's soln with int output
    In [192]: %timeit data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(int)
    100 loops, best of 3: 13.4 ms per loop
    
    # @Dark's soln with bool output
    In [194]: %timeit data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(bool)
    100 loops, best of 3: 10 ms per loop
    
    # Proposed in this post
    In [195]: %%timeit
         ...: ar = np.zeros(len(data), dtype=int)
         ...: ar[data.open_signal.values] = 1
         ...: ar[data.close_signal.values] = -1
         ...: data['out'] = ar.cumsum().astype(bool)
    100 loops, best of 3: 7.52 ms per loop
    

    Generic case

    Now, to solve for a generic case with :

    1] A closed signal coming after no prior open signal.

    2] Multiple open signals coming prior to the next close signal.

    3] Multiple close signals coming prior to the next open signal.

    We would need few more steps.

    Approach #1 : Here's one based upon searchsorted -

    s0 = np.flatnonzero(data.open_signal.values)
    s1 = np.flatnonzero(data.close_signal.values)
    
    idx0 = np.searchsorted(s1,s0,'right')
    s0c = s0[np.r_[True,idx0[1:] > idx0[:-1]]]
    
    idx1 = np.searchsorted(s0c,s1,'right')
    s1c = s1[np.r_[True,idx1[1:] > idx1[:-1]]]
    
    ar = np.zeros(len(data), dtype=int)
    ar[s0c] = 1
    ar[s1c] = -1
    if s1c[0] < s0c[0]:
        ar[s1c[0]] = 0
    data['out'] = ar.cumsum().astype(bool)
    

    Sample output -

    In [360]: data
    Out[360]: 
        close_signal  open_signal    out
    0          False        False  False
    1          False        False  False
    2           True        False  False
    3          False        False  False
    4          False        False  False
    5          False         True   True
    6          False        False   True
    7          False         True   True
    8          False        False   True
    9           True        False  False
    10         False        False  False
    11          True        False  False
    12         False        False  False
    13         False        False  False
    

    Approach #2 : Possibly faster one as we would avoid using searchsorted and instead leverage masking -

    mix_arr = data.open_signal.values.astype(int) - data.close_signal.values
    ar = np.zeros(len(data), dtype=int)
    mix_mask = mix_arr!=0
    mix_val = mix_arr[mix_mask]
        
    valid_mask = np.r_[True, mix_val[1:] != mix_val[:-1]]
    ar[mix_mask] = mix_arr[mix_mask]*valid_mask
    if mix_val[0] == -1:
        ar[mix_mask.argmax()] = 0    
    
    data['out'] = ar.cumsum().astype(bool)