I'm trying to simulate financial trades using a vectorized approach in python. Part of this includes removing duplicate signals.
To elaborate, I've developed a buy_signal
column and a sell_signal
column. These columns contain booleans in the form of 1s and 0s.
Looking at the signals from the top-down, I don't want to trigger a second buy_signal
before a sell_signal
triggers, AKA if a 'position' is open. Same thing with sell signals, I do not want duplicate sell signals if a 'position' is closed. If a sell_signal
and buy_signal
are 1
, set them both to 0.
What is the best way to remove these irrelevant signals?
Here's an example:
import pandas as pd
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
print(df)
buy_signal sell_signal
0 1 0
1 1 0
2 1 1
3 1 1
4 0 1
5 0 0
6 1 0
7 1 0
8 1 1
9 0 0
Here's the result I want:
buy_signal sell_signal
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 1 0
7 0 0
8 0 1
9 0 0
As I said earlier (in a comment about a response since then deleted), one must consider the interaction between buy and sell signals, and cannot simply operate on each independently.
The key idea is to consider a quantity q
(or "position") that is the amount currently held, and that the OP says would like bounded to [0, 1]
. That quantity is cumsum(buy - sell)
after cleaning.
Therefore, the problem reduces to "cumulative sum with limits", which unfortunately cannot be done in a vectorized way with numpy
or pandas
, but that we can code quite efficiently using numba
. The code below processes 1 million rows in 37 ms.
import numpy as np
from numba import njit
@njit
def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
res = np.empty_like(a)
c = 0
for i in range(len(a)):
c = min(max(c + a[i], xmin), xmax)
res[i] = c
return res
def clean_buy_sell(df, xmin=0, xmax=1):
# model the quantity held: cumulative sum of buy-sell clipped in
# [xmin, xmax]
# note that, when buy and sell are equal, there is no change
q = cumsum_clip(
(df['buy_signal'] - df['sell_signal']).values,
xmin=xmin, xmax=xmax)
# derive actual transactions: positive for buy, negative for sell, 0 for hold
trans = np.diff(np.r_[0, q])
df = df.assign(
buy_signal=np.clip(trans, 0, None),
sell_signal=np.clip(-trans, 0, None),
)
return df
Now:
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
new_df = clean_buy_sell(df)
>>> new_df
buy_signal sell_signal
0 1 0
1 0 0
2 0 0
3 0 0
4 0 1
5 0 0
6 1 0
7 0 0
8 0 0
9 0 0
n = 1_000_000
np.random.seed(0) # repeatable example
df = pd.DataFrame(np.random.choice([0, 1], (n, 2)),
columns=['buy_signal', 'sell_signal'])
%timeit clean_buy_sell(df)
37.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Correctness tests:
z = clean_buy_sell(df)
q = (z['buy_signal'] - z['sell_signal']).cumsum()
# q is quantity held through time; must be in {0, 1}
assert q.isin({0, 1}).all()
# we should not have introduced any new buy signal:
# check that any buy == 1 in z was also 1 in df
assert not (z['buy_signal'] & ~df['buy_signal']).any()
# same for sell signal:
assert not (z['sell_signal'] & ~df['sell_signal']).any()
# finally, buy and sell should never be 1 on the same row:
assert not (z['buy_signal'] & z['sell_signal']).any()
For fun, we can consider the more general case where buy and sell values are fractional (or any float value), and the limits are not [0, 1]
. There is nothing to change to the current version of clean_buy_sell
, which is general enough to handle these conditions.
np.random.seed(0)
df = pd.DataFrame(
np.random.uniform(0, 1, (100, 2)),
columns=['buy_signal', 'sell_signal'],
)
# set limits to -1, 2: we can sell short (borrow) up to 1 unit
# and own up to 2 units.
z = clean_buy_sell(df, -1, 2)
(z['buy_signal'] - z['sell_signal']).cumsum().plot()