Search code examples
pythonpandasdata-cleaning

Python Pandas: Vectorized Way of Cleaning Buy and Sell Signals


I'm trying to simulate financial trades using a vectorized approach in python. Part of this includes removing duplicate signals.

To elaborate, I've developed a buy_signal column and a sell_signal column. These columns contain booleans in the form of 1s and 0s.

Looking at the signals from the top-down, I don't want to trigger a second buy_signal before a sell_signal triggers, AKA if a 'position' is open. Same thing with sell signals, I do not want duplicate sell signals if a 'position' is closed. If a sell_signal and buy_signal are 1, set them both to 0.

What is the best way to remove these irrelevant signals?

Here's an example:

import pandas as pd

df = pd.DataFrame(
    {
        "buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
        "sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
    }
)
print(df)
   buy_signal  sell_signal
0           1            0
1           1            0
2           1            1
3           1            1
4           0            1
5           0            0
6           1            0
7           1            0
8           1            1
9           0            0

Here's the result I want:

   buy_signal  sell_signal
0           1            0
1           0            0
2           0            1
3           0            0
4           0            0
5           0            0
6           1            0
7           0            0
8           0            1
9           0            0

Solution

  • As I said earlier (in a comment about a response since then deleted), one must consider the interaction between buy and sell signals, and cannot simply operate on each independently.

    The key idea is to consider a quantity q (or "position") that is the amount currently held, and that the OP says would like bounded to [0, 1]. That quantity is cumsum(buy - sell) after cleaning.

    Therefore, the problem reduces to "cumulative sum with limits", which unfortunately cannot be done in a vectorized way with numpy or pandas, but that we can code quite efficiently using numba. The code below processes 1 million rows in 37 ms.

    import numpy as np
    from numba import njit
    
    @njit
    def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
        res = np.empty_like(a)
        c = 0
        for i in range(len(a)):
            c = min(max(c + a[i], xmin), xmax)
            res[i] = c
        return res
    
    
    def clean_buy_sell(df, xmin=0, xmax=1):
        # model the quantity held: cumulative sum of buy-sell clipped in
        # [xmin, xmax]
        # note that, when buy and sell are equal, there is no change
        q = cumsum_clip(
            (df['buy_signal'] - df['sell_signal']).values,
            xmin=xmin, xmax=xmax)
    
        # derive actual transactions: positive for buy, negative for sell, 0 for hold
        trans = np.diff(np.r_[0, q])
        df = df.assign(
            buy_signal=np.clip(trans, 0, None),
            sell_signal=np.clip(-trans, 0, None),
        )
        
        return df
    

    Now:

    df = pd.DataFrame(
        {
            "buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
            "sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
        }
    )
    new_df = clean_buy_sell(df)
    
    >>> new_df
       buy_signal  sell_signal
    0           1            0
    1           0            0
    2           0            0
    3           0            0
    4           0            1
    5           0            0
    6           1            0
    7           0            0
    8           0            0
    9           0            0
    

    Speed and correctness

    n = 1_000_000
    
    np.random.seed(0)  # repeatable example
    df = pd.DataFrame(np.random.choice([0, 1], (n, 2)),
                      columns=['buy_signal', 'sell_signal'])
    
    %timeit clean_buy_sell(df)
    37.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Correctness tests:

    z = clean_buy_sell(df)
    q = (z['buy_signal'] - z['sell_signal']).cumsum()
    
    # q is quantity held through time; must be in {0, 1}
    assert q.isin({0, 1}).all()
    
    # we should not have introduced any new buy signal:
    # check that any buy == 1 in z was also 1 in df
    
    assert not (z['buy_signal'] & ~df['buy_signal']).any()
    
    # same for sell signal:
    assert not (z['sell_signal'] & ~df['sell_signal']).any()
    
    # finally, buy and sell should never be 1 on the same row:
    assert not (z['buy_signal'] & z['sell_signal']).any()
    

    Bonus: other limits, fractional buys and sells

    For fun, we can consider the more general case where buy and sell values are fractional (or any float value), and the limits are not [0, 1]. There is nothing to change to the current version of clean_buy_sell, which is general enough to handle these conditions.

    np.random.seed(0)
    df = pd.DataFrame(
        np.random.uniform(0, 1, (100, 2)),
        columns=['buy_signal', 'sell_signal'],
    )
    
    # set limits to -1, 2: we can sell short (borrow) up to 1 unit
    # and own up to 2 units.
    z = clean_buy_sell(df, -1, 2)
    
    (z['buy_signal'] - z['sell_signal']).cumsum().plot()