I have 1 minute ohlcv data in a pandas dataframe. I want to resample it with 5 minute intervals and see if the high of the 5 minutes was hit first, or the low, all the while keeping the ohlcv values.
Input is like
datetime | open | high | low | close |
---|---|---|---|---|
2022-01-01 10:00:00 | 10 | 12 | 9 | 11 |
2022-01-01 10:01:00 | 11 | 14 | 9 | 12 |
2022-01-01 10:02:00 | 12 | 12 | 8 | 10 |
2022-01-01 10:03:00 | 10 | 15 | 9 | 11 |
2022-01-01 10:04:00 | 10 | 12 | 8 | 11 |
2022-01-01 10:05:00 | 11 | 11 | 9 | 10 |
Output is like
datetime | open | high | low | close | high_first |
---|---|---|---|---|---|
2022-01-01 10:00:00 | 10 | 15 | 8 | 11 | 0 |
2022-01-01 10:05:00 | 11 | 11 | 9 | 10 | 1 |
First, I tried the simple way:
I would find the indexes of where high would reach its max and low would reach its min, then I would add another column comparing those.
df.resample("15min").agg({"high": ["idxmax", "max"], "low": ["idxmin", "min"]})
But I got an error:
ValueError: attempt to get argmax of an empty sequence
Because my data is not continuous (it cuts off at holidays)
So, left to my own devices, I wrote an apply function:
def transform(x):
x["open_first"] = x["open"][0]
x["close_last"] = x["close"][-1]
x["high_max"] = x["high"].max()
x["low_min"] = x["low"].min()
x["high_idxmax"] = x["high"].idxmax()
x["low_idxmin"] = x["low"].idxmin()
x["volume_sum"] = x["volume"].sum()
x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
return x
But it is very slow. Is it possible to make it faster and avoid df.apply
?
You can use DatetimeIndex.floor
for 5 minutes values with not continuous values of DatetimeIndex
:
df1 = (df.groupby(df.index.floor('5min'))
.agg(open_first=('open','first'),
close_last=('close','last'),
high_max=('high','max'),
low_min=('low','min'),
high_idxmax=('high','idxmax'),
low_idxmin=('low','idxmin'))
.assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)
open_first close_last high_max low_min \
datetime
2022-01-01 10:00:00 10 11 15 8
2022-01-01 10:05:00 11 10 11 9
high_idxmax low_idxmin high_first
datetime
2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00 False
2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00 False
Your solution create consecutive index, for idxmax/idmin
are necessary lambda functions:
df1 = (df.groupby(pd.Grouper(freq='5Min'))
.agg(open_first=('open','first'),
close_last=('close','last'),
high_max=('high','max'),
low_min=('low','min'),
high_idxmax=('high',lambda x : np.nan if x.count() == 0 else x.idxmax()),
low_idxmin=('low',lambda x : np.nan if x.count() == 0 else x.idxmin()))
.assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)