Using pd.Grouper with a datetime key in conjunction with another key creates a set of groups, but this does not seem to encompass all of the groups that need to be created, in my opinion.
>>> test = pd.DataFrame({"id":["a","b"]*3, "b":pd.date_range("2000-01-01","2000-01-03", freq="9H")})
>>> test
id b
0 a 2000-01-01 00:00:00
1 b 2000-01-01 09:00:00
2 a 2000-01-01 18:00:00
3 b 2000-01-02 03:00:00
4 a 2000-01-02 12:00:00
5 b 2000-01-02 21:00:00
When I tried to create groups based on the date and id
values:
>>> g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
>>> g.groups
{(2000-01-01 00:00:00, 'a'): [0], (2000-01-02 00:00:00, 'b'): [1]}
g.groups
shows only 2 groups when I expected 4 groups: both "a" and "b" for each day.
However, when I created another column based off of "b":
>>> test['date'] = test.b.dt.date
>>> g = test.groupby(['date', 'id'])
>>> g.groups
{(2000-01-01, 'a'): [0, 2], (2000-01-01, 'b'): [1], (2000-01-02, 'a'): [4], (2000-01-02, 'b'): [3, 5]}
The outcome was exactly what I expected.
I don't know how to make sense of these different outcomes. Please enlighten me.
You do have 4 groups with the Grouper
, the output of g.groups
is misleading (maybe worth reporting as a bug?):
g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
g.ngroups
# 4
g.size()
# b id
# 2000-01-01 a 2
# b 1
# 2000-01-02 a 1
# b 2
# dtype: int64