Search code examples
pythonpandasdatetimegroup-by

pd.Grouper with datetime key in conjunction with another grouping key seemingly creates the wrong number of groups


Using pd.Grouper with a datetime key in conjunction with another key creates a set of groups, but this does not seem to encompass all of the groups that need to be created, in my opinion.

>>> test = pd.DataFrame({"id":["a","b"]*3, "b":pd.date_range("2000-01-01","2000-01-03", freq="9H")})
>>> test
    id  b
0   a   2000-01-01 00:00:00
1   b   2000-01-01 09:00:00
2   a   2000-01-01 18:00:00
3   b   2000-01-02 03:00:00
4   a   2000-01-02 12:00:00
5   b   2000-01-02 21:00:00

When I tried to create groups based on the date and id values:

>>> g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
>>> g.groups
{(2000-01-01 00:00:00, 'a'): [0], (2000-01-02 00:00:00, 'b'): [1]}

g.groups shows only 2 groups when I expected 4 groups: both "a" and "b" for each day.

However, when I created another column based off of "b":

>>> test['date'] = test.b.dt.date
>>> g = test.groupby(['date', 'id'])
>>> g.groups
{(2000-01-01, 'a'): [0, 2], (2000-01-01, 'b'): [1], (2000-01-02, 'a'): [4], (2000-01-02, 'b'): [3, 5]}

The outcome was exactly what I expected.

I don't know how to make sense of these different outcomes. Please enlighten me.


Solution

  • You do have 4 groups with the Grouper, the output of g.groups is misleading (maybe worth reporting as a bug?):

    g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
    
    g.ngroups
    # 4
    
    g.size()
    # b           id
    # 2000-01-01  a     2
    #             b     1
    # 2000-01-02  a     1
    #             b     2
    # dtype: int64