I'm trying to "combine" consecutive rows of like data where certain conditions match, and everything I've tried is throwing errors or putting the data together in an unexpected way.
Data:
open high low close volume datetime
0 257.31 259.04 255.63 257.86 335889185 1510552800000
1 258.14 260.48 257.86 260.36 190142219 1511157600000
2 260.41 266.05 260.00 264.46 521044032 1511762400000
3 266.31 266.80 262.71 265.51 401716112 1512367200000
4 265.58 267.56 265.39 266.51 516455674 1512972000000
.. ... ... ... ... ... ...
151 336.06 347.35 334.38 346.85 297612670 1601874000000
152 349.59 354.02 343.13 347.29 361462322 1602478800000
153 348.65 349.33 340.65 345.78 296595696 1603083600000
154 342.13 342.98 322.60 326.54 495607791 1603688400000
155 330.20 352.19 327.24 350.16 463334913 1604296800000
I want to combine consecutive rows that are open > close
and close > open
, so that I can have one large candle (this is stock data) for consecutive of the same candles.
Initially I started with making a column to signify which type of row it is (probably not required and the comparison can be done in a one-liner during the row merges?):
def green_or_red(self, row):
if row['open'] > row['close']:
val = 'R'
elif row['open'] < row['close']:
val = 'G'
else:
val = 'N'
return val
df['candle_is'] = df.apply(green_or_red, axis=1)
Which assigns it correctly, however the merging of consecutive rows is where I have an issue:
# merge the consecutive same types of candles
g = df['candle_is'].ne(df['candle_is'].shift()).cumsum()
dfn = df.groupby(['candle_is', g], sort=False).agg({'open': max, 'close': min, 'high': max, 'low': min, 'volume': sum})
produces:
open close high low volume
candle_is candle_is
G 1 260.41 257.86 266.0500 255.63 1047075436
R 2 266.31 265.51 266.8000 262.71 401716112
G 3 265.58 266.51 267.5600 265.39 516455674
R 4 268.10 266.86 268.6000 266.64 632660142
G 5 280.17 273.42 286.6285 267.40 1655227273
... ... ... ... ... ...
73 342.12 326.52 350.7200 319.64 1280999271
R 74 350.35 330.65 358.7500 327.97 1257122392
G 75 336.06 328.73 347.3500 319.80 1099865805
R 76 349.59 326.54 354.0200 322.60 1153665809
G 77 330.20 350.16 352.1900 327.24 463334913
But I need to separate the logic between red (R) and green (G) candles so that the agg() works a bit differently because open/close values should be swapped between min/max for each type:
# green
df.groupby(['candle_is', g], sort=False).agg({'open': max, 'close': min, 'high': max, 'low': min, 'volume': sum})
# red
df.groupby(['candle_is', g], sort=False).agg({'open': min, 'close': max, 'high': max, 'low': min, 'volume': sum})
However I can't find a way to utilize g
or df['candle_is'] == 'G'
to target those specifically without getting a slew of errors, because once I filter the data, the size does not match. How can this be accomplished sanely? Thanks!
If you are seeking to swap your min/max
, it might be easier noticing that max(-array) = -min(array)
. So we can just multiply the data with -1
and multiply back:
# use this instead of `apply`, which is not vectorized
candles = np.select([df['open']>df['close'], df['open']<df['close']],
['R','G'], 'N')
# turn candles into series
candles =pd.Series(candles, index=df.index)
g = candles.ne(candles.shift()).cumsum()
# change sign of `red` candles so min becomes max and so on
multipliers = np.where(candles=='R', -1, 1)
# groupby as usual
# note that `'max'` is vectorize while `max` is not
ret = (df.mul(multipliers, axis='rows')
.groupby([candles, g], sort=False)
.agg({'open': 'max', 'close': 'min',
'high': 'max', 'low': 'min',
'volume': 'sum'})
)
# multiply the red candles by `-1`
# Since we are working with MultiIndex, we slice by the level values
ret.loc[ret.index.get_level_values(0)=='R'] *= -1
Output for the sample data (notice the values in the 2nd R
group):
open close high low volume
candle_is
G 1 260.41 257.86 266.05 255.63 1047075436
R 2 266.31 265.51 266.80 262.71 401716112
G 3 336.06 266.51 347.35 265.39 814068344
R 4 342.13 347.29 342.98 343.13 1153665809
G 5 330.20 350.16 352.19 327.24 463334913