Search code examples
pythonpandasquantitative-finance

Pandas GroupBy Consecutive with Where Condition


I'm trying to "combine" consecutive rows of like data where certain conditions match, and everything I've tried is throwing errors or putting the data together in an unexpected way.

Data:

       open    high     low   close     volume       datetime
0    257.31  259.04  255.63  257.86  335889185  1510552800000
1    258.14  260.48  257.86  260.36  190142219  1511157600000
2    260.41  266.05  260.00  264.46  521044032  1511762400000
3    266.31  266.80  262.71  265.51  401716112  1512367200000
4    265.58  267.56  265.39  266.51  516455674  1512972000000
..      ...     ...     ...     ...        ...            ...
151  336.06  347.35  334.38  346.85  297612670  1601874000000
152  349.59  354.02  343.13  347.29  361462322  1602478800000
153  348.65  349.33  340.65  345.78  296595696  1603083600000
154  342.13  342.98  322.60  326.54  495607791  1603688400000
155  330.20  352.19  327.24  350.16  463334913  1604296800000

I want to combine consecutive rows that are open > close and close > open, so that I can have one large candle (this is stock data) for consecutive of the same candles.

Initially I started with making a column to signify which type of row it is (probably not required and the comparison can be done in a one-liner during the row merges?):

def green_or_red(self, row):
        if row['open'] > row['close']:
            val = 'R'
        elif row['open'] < row['close']:
            val = 'G'
        else:
            val = 'N'
        return val
df['candle_is'] = df.apply(green_or_red, axis=1)

Which assigns it correctly, however the merging of consecutive rows is where I have an issue:

# merge the consecutive same types of candles
g = df['candle_is'].ne(df['candle_is'].shift()).cumsum()
dfn = df.groupby(['candle_is', g], sort=False).agg({'open': max, 'close': min, 'high': max, 'low': min, 'volume': sum})

produces:

                       open   close      high     low      volume
candle_is candle_is
G         1          260.41  257.86  266.0500  255.63  1047075436
R         2          266.31  265.51  266.8000  262.71   401716112
G         3          265.58  266.51  267.5600  265.39   516455674
R         4          268.10  266.86  268.6000  266.64   632660142
G         5          280.17  273.42  286.6285  267.40  1655227273
...                     ...     ...       ...     ...         ...
          73         342.12  326.52  350.7200  319.64  1280999271
R         74         350.35  330.65  358.7500  327.97  1257122392
G         75         336.06  328.73  347.3500  319.80  1099865805
R         76         349.59  326.54  354.0200  322.60  1153665809
G         77         330.20  350.16  352.1900  327.24   463334913

But I need to separate the logic between red (R) and green (G) candles so that the agg() works a bit differently because open/close values should be swapped between min/max for each type:

# green
df.groupby(['candle_is', g], sort=False).agg({'open': max, 'close': min, 'high': max, 'low': min, 'volume': sum})
# red
df.groupby(['candle_is', g], sort=False).agg({'open': min, 'close': max, 'high': max, 'low': min, 'volume': sum})

However I can't find a way to utilize g or df['candle_is'] == 'G' to target those specifically without getting a slew of errors, because once I filter the data, the size does not match. How can this be accomplished sanely? Thanks!


Solution

  • If you are seeking to swap your min/max, it might be easier noticing that max(-array) = -min(array). So we can just multiply the data with -1 and multiply back:

    # use this instead of `apply`, which is not vectorized
    candles = np.select([df['open']>df['close'], df['open']<df['close']],
                        ['R','G'], 'N')
    
    # turn candles into series
    candles =pd.Series(candles, index=df.index)
    
    g = candles.ne(candles.shift()).cumsum()
    
    # change sign of `red` candles so min becomes max and so on
    multipliers = np.where(candles=='R', -1, 1)
    
    # groupby as usual
    # note that `'max'` is vectorize while `max` is not
    ret = (df.mul(multipliers, axis='rows')
           .groupby([candles, g], sort=False)
           .agg({'open': 'max', 'close': 'min', 
                 'high': 'max', 'low': 'min', 
                 'volume': 'sum'})
    )
    
    # multiply the red candles by `-1`
    # Since we are working with MultiIndex, we slice by the level values 
    ret.loc[ret.index.get_level_values(0)=='R'] *= -1
    

    Output for the sample data (notice the values in the 2nd R group):

                   open   close    high     low      volume
      candle_is                                            
    G 1          260.41  257.86  266.05  255.63  1047075436
    R 2          266.31  265.51  266.80  262.71   401716112
    G 3          336.06  266.51  347.35  265.39   814068344
    R 4          342.13  347.29  342.98  343.13  1153665809
    G 5          330.20  350.16  352.19  327.24   463334913