Really scratching my head on why I keep getting an empty dataframe back when I try to do a groupby and agg on time series data.
Timestamp Price Size
57175 2023-08-25 05:30:02-05:00 4397.50 1
57176 2023-08-25 05:30:02-05:00 4397.50 2
57177 2023-08-25 05:30:02-05:00 4397.25 1
57178 2023-08-25 05:30:02-05:00 4397.25 1
57179 2023-08-25 05:30:02-05:00 4397.25 1
... ... ... ... ... ...
1286370 2023-08-25 15:55:00-05:00 4414.25 5
1286371 2023-08-25 15:55:00-05:00 4414.25 2
1286372 2023-08-25 15:55:00-05:00 4414.25 2
1286373 2023-08-25 15:55:00-05:00 4414.25 1
1286374 2023-08-25 15:55:00-05:00 4414.25 1
I have data like so, I try to run the following on it:
def bar(x, y):
return np.int64(x / y) * y
vol_bars = df.groupby(bar(np.cumsum(df['Size']), traded_vol)).agg({'Price': 'ohlc', 'Size': 'sum'})
vol_bars
My notebook shows the result as:
Price Size
open high low close Size
0 4397.50 4398.75 4397.25 4398.00 499
500 4398.00 4398.50 4397.50 4397.75 500
1000 4397.75 4399.00 4397.75 4398.75 500
1500 4398.75 4399.00 4396.50 4396.50 500
2000 4396.50 4398.25 4396.25 4398.00 500
... ... ... ... ... ...
1674500 4414.75 4414.75 4414.25 4414.50 498
1675000 4414.50 4414.75 4414.25 4414.25 482
1675500 4414.25 4414.75 4414.00 4414.50 517
1676000 4414.50 4414.50 4414.25 4414.25 491
1676500 4414.25 4414.25 4414.00 4414.25 151
Which is perfect but when I try to do anything with this new 'vol_bars' dataframe I get missing key errors. When I check the variable in VS code it displays like this:
Where am I going wrong?
Thank you
Regarding missing keys, your image has the answer: after aggregating, the labels for each column become MultiIndex
, namely they're no longer single words but 2-tuple of words, e.g. ('Price', 'open')
. Chances are that you're using the OHLC labels alone. Try giving things like ('Price', 'open')
to the loc operator of vol_bars
and you'll see your data.
If this isn't the intended behavior, then modify the arguments of groupby
accordingly. Alternatively, if you want to keep the OHLC prices with their own names, use "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html" to drop the first level of your multi-indexes.
Assuming you're getting the second table from the first one, you'll need to give us the definition of ohlc
in order to help you further.