Search code examples
pythongroup-by

Dataframe groupby and agg returns dataframe with empty columns but maintains index in VS Code?


Really scratching my head on why I keep getting an empty dataframe back when I try to do a groupby and agg on time series data.

    Timestamp   Price   Size        
57175   2023-08-25 05:30:02-05:00   4397.50 1       
57176   2023-08-25 05:30:02-05:00   4397.50 2       
57177   2023-08-25 05:30:02-05:00   4397.25 1       
57178   2023-08-25 05:30:02-05:00   4397.25 1       
57179   2023-08-25 05:30:02-05:00   4397.25 1       
... ... ... ... ... ...
1286370 2023-08-25 15:55:00-05:00   4414.25 5       
1286371 2023-08-25 15:55:00-05:00   4414.25 2       
1286372 2023-08-25 15:55:00-05:00   4414.25 2       
1286373 2023-08-25 15:55:00-05:00   4414.25 1       
1286374 2023-08-25 15:55:00-05:00   4414.25 1       

I have data like so, I try to run the following on it:

def bar(x, y):
    return np.int64(x / y) * y

vol_bars = df.groupby(bar(np.cumsum(df['Size']), traded_vol)).agg({'Price': 'ohlc', 'Size': 'sum'})
vol_bars

My notebook shows the result as:

    Price   Size
open    high    low close   Size
0   4397.50 4398.75 4397.25 4398.00 499
500 4398.00 4398.50 4397.50 4397.75 500
1000    4397.75 4399.00 4397.75 4398.75 500
1500    4398.75 4399.00 4396.50 4396.50 500
2000    4396.50 4398.25 4396.25 4398.00 500
... ... ... ... ... ...
1674500 4414.75 4414.75 4414.25 4414.50 498
1675000 4414.50 4414.75 4414.25 4414.25 482
1675500 4414.25 4414.75 4414.00 4414.50 517
1676000 4414.50 4414.50 4414.25 4414.25 491
1676500 4414.25 4414.25 4414.00 4414.25 151

Which is perfect but when I try to do anything with this new 'vol_bars' dataframe I get missing key errors. When I check the variable in VS code it displays like this:

Empty Frame?

Where am I going wrong?

Thank you


Solution

  • Regarding missing keys, your image has the answer: after aggregating, the labels for each column become MultiIndex, namely they're no longer single words but 2-tuple of words, e.g. ('Price', 'open'). Chances are that you're using the OHLC labels alone. Try giving things like ('Price', 'open') to the loc operator of vol_bars and you'll see your data.

    If this isn't the intended behavior, then modify the arguments of groupby accordingly. Alternatively, if you want to keep the OHLC prices with their own names, use "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html" to drop the first level of your multi-indexes.

    Assuming you're getting the second table from the first one, you'll need to give us the definition of ohlc in order to help you further.