Python Pandas: Vectorized Way of Cleaning Buy and Sell Signals

I'm trying to simulate financial trades using a vectorized approach in python. Part of this includes removing duplicate signals.

To elaborate, I've developed a buy_signal column and a sell_signal column. These columns contain booleans in the form of 1s and 0s.

Looking at the signals from the top-down, I don't want to trigger a second buy_signal before a sell_signal triggers, AKA if a 'position' is open. Same thing with sell signals, I do not want duplicate sell signals if a 'position' is closed. If a sell_signal and buy_signal are 1, set them both to 0.

What is the best way to remove these irrelevant signals?

Here's an example:

import pandas as pd

df = pd.DataFrame(
    {
        "buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
        "sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
    }
)
print(df)

   buy_signal  sell_signal
0           1            0
1           1            0
2           1            1
3           1            1
4           0            1
5           0            0
6           1            0
7           1            0
8           1            1
9           0            0

Here's the result I want:

   buy_signal  sell_signal
0           1            0
1           0            0
2           0            1
3           0            0
4           0            0
5           0            0
6           1            0
7           0            0
8           0            1
9           0            0

Solution

As I said earlier (in a comment about a response since then deleted), one must consider the interaction between buy and sell signals, and cannot simply operate on each independently.

The key idea is to consider a quantity q (or "position") that is the amount currently held, and that the OP says would like bounded to [0, 1]. That quantity is cumsum(buy - sell) after cleaning.

Therefore, the problem reduces to "cumulative sum with limits", which unfortunately cannot be done in a vectorized way with numpy or pandas, but that we can code quite efficiently using numba. The code below processes 1 million rows in 37 ms.

import numpy as np
from numba import njit

@njit
def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
    res = np.empty_like(a)
    c = 0
    for i in range(len(a)):
        c = min(max(c + a[i], xmin), xmax)
        res[i] = c
    return res


def clean_buy_sell(df, xmin=0, xmax=1):
    # model the quantity held: cumulative sum of buy-sell clipped in
    # [xmin, xmax]
    # note that, when buy and sell are equal, there is no change
    q = cumsum_clip(
        (df['buy_signal'] - df['sell_signal']).values,
        xmin=xmin, xmax=xmax)

    # derive actual transactions: positive for buy, negative for sell, 0 for hold
    trans = np.diff(np.r_[0, q])
    df = df.assign(
        buy_signal=np.clip(trans, 0, None),
        sell_signal=np.clip(-trans, 0, None),
    )
    
    return df

Now:

df = pd.DataFrame(
    {
        "buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
        "sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
    }
)
new_df = clean_buy_sell(df)

>>> new_df
   buy_signal  sell_signal
0           1            0
1           0            0
2           0            0
3           0            0
4           0            1
5           0            0
6           1            0
7           0            0
8           0            0
9           0            0

Speed and correctness

n = 1_000_000

np.random.seed(0)  # repeatable example
df = pd.DataFrame(np.random.choice([0, 1], (n, 2)),
                  columns=['buy_signal', 'sell_signal'])

%timeit clean_buy_sell(df)
37.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Correctness tests:

z = clean_buy_sell(df)
q = (z['buy_signal'] - z['sell_signal']).cumsum()

# q is quantity held through time; must be in {0, 1}
assert q.isin({0, 1}).all()

# we should not have introduced any new buy signal:
# check that any buy == 1 in z was also 1 in df

assert not (z['buy_signal'] & ~df['buy_signal']).any()

# same for sell signal:
assert not (z['sell_signal'] & ~df['sell_signal']).any()

# finally, buy and sell should never be 1 on the same row:
assert not (z['buy_signal'] & z['sell_signal']).any()

Bonus: other limits, fractional buys and sells

For fun, we can consider the more general case where buy and sell values are fractional (or any float value), and the limits are not [0, 1]. There is nothing to change to the current version of clean_buy_sell, which is general enough to handle these conditions.

np.random.seed(0)
df = pd.DataFrame(
    np.random.uniform(0, 1, (100, 2)),
    columns=['buy_signal', 'sell_signal'],
)

# set limits to -1, 2: we can sell short (borrow) up to 1 unit
# and own up to 2 units.
z = clean_buy_sell(df, -1, 2)

(z['buy_signal'] - z['sell_signal']).cumsum().plot()