python pandas numpy algorithmic-trading back-testing

Vectorising a for loop where logic spans rows

The table below shows an example of a strategy where a signal is generated in row 2 and then an opposite signal is generated in row 5.

     row     open_signal     close_signal      live
      1           0               0             0
      2           1               0             1
      3           0               0             1
      4           0               0             1
      5           0               1             0
      6           0               0             0

I want to optimise the calculation of the live column.

Is there a way to vectorise this problem in either Pandas or Numpy for improved performance, generating the same result as the for loop example below?

import pandas as pd
from datetime import datetime

example = {'date': [str(datetime(2017,1,1)), str(datetime(2017,1,2)),str(datetime(2017,1,3)),str(datetime(2017,1,4)),str(datetime(2017,1,5)),str(datetime(2017,1,6)),
                    str(datetime(2017,1,7)), str(datetime(2017,1,8)),str(datetime(2017,1,9)), str(datetime(2017,1,10)),str(datetime(2017,1,11)), str(datetime(2017,1,12)),
                    str(datetime(2017,1,13)),str(datetime(2017,1,14))],
           'open':        [142.11, 142.87, 141.87, 142.11, 142.00, 142.41, 142.50, 142.75, 140.87, 141.25, 141.10, 141.15, 142.55, 142.75],
           'close':       [142.87, 141.87, 142.11, 142.00, 142.41, 142.50, 142.75, 140.87, 141.25, 141.10, 141.15, 142.55, 142.75, 142.11],
           'open_signal': [False,  False,  False,  False,  False,  True,  False,  False,  False,  False,  False,  False,  False,  False],
           'close_signal':[False,  False,  False,  False,  False,  False,  False,  False,  False,   True,  False,  False,  False,  False]
           }

data = pd.DataFrame(example)

in_trade = False
for i in data.iterrows():
    if i[1].open_signal:
        in_trade = True
    if i[1].close_signal:
        in_trade = False
    data.loc[i[0],'in_trade'] = in_trade

Solution

Simplistic case

For the simplistic case as in the posted sample, here's one vectorized way with NumPy -

ar = np.zeros(len(data), dtype=int)
ar[data.open_signal.values] = 1
ar[data.close_signal.values] = -1
data['out'] = ar.cumsum().astype(bool)

Runtime test -

Using the sample dataset and scaling it 100000 times along the rows for the testing.

In [191]: data = pd.concat([data]*100000,axis=0); data.index = range(len(data))

# @Dark's soln with int output
In [192]: %timeit data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(int)
100 loops, best of 3: 13.4 ms per loop

# @Dark's soln with bool output
In [194]: %timeit data['new'] = data['open_signal'].cumsum().ne(data['close_signal'].cumsum()).astype(bool)
100 loops, best of 3: 10 ms per loop

# Proposed in this post
In [195]: %%timeit
     ...: ar = np.zeros(len(data), dtype=int)
     ...: ar[data.open_signal.values] = 1
     ...: ar[data.close_signal.values] = -1
     ...: data['out'] = ar.cumsum().astype(bool)
100 loops, best of 3: 7.52 ms per loop

Generic case

Now, to solve for a generic case with :

1] A closed signal coming after no prior open signal.

2] Multiple open signals coming prior to the next close signal.

3] Multiple close signals coming prior to the next open signal.

We would need few more steps.

Approach #1 : Here's one based upon searchsorted -

s0 = np.flatnonzero(data.open_signal.values)
s1 = np.flatnonzero(data.close_signal.values)

idx0 = np.searchsorted(s1,s0,'right')
s0c = s0[np.r_[True,idx0[1:] > idx0[:-1]]]

idx1 = np.searchsorted(s0c,s1,'right')
s1c = s1[np.r_[True,idx1[1:] > idx1[:-1]]]

ar = np.zeros(len(data), dtype=int)
ar[s0c] = 1
ar[s1c] = -1
if s1c[0] < s0c[0]:
    ar[s1c[0]] = 0
data['out'] = ar.cumsum().astype(bool)

Sample output -

In [360]: data
Out[360]: 
    close_signal  open_signal    out
0          False        False  False
1          False        False  False
2           True        False  False
3          False        False  False
4          False        False  False
5          False         True   True
6          False        False   True
7          False         True   True
8          False        False   True
9           True        False  False
10         False        False  False
11          True        False  False
12         False        False  False
13         False        False  False

Approach #2 : Possibly faster one as we would avoid using searchsorted and instead leverage masking -

mix_arr = data.open_signal.values.astype(int) - data.close_signal.values
ar = np.zeros(len(data), dtype=int)
mix_mask = mix_arr!=0
mix_val = mix_arr[mix_mask]
    
valid_mask = np.r_[True, mix_val[1:] != mix_val[:-1]]
ar[mix_mask] = mix_arr[mix_mask]*valid_mask
if mix_val[0] == -1:
    ar[mix_mask.argmax()] = 0    

data['out'] = ar.cumsum().astype(bool)