Search code examples
pythonpandasgroup-bypd.grouper

Using pd.groupby() with pd.Grouper() consisting of date shows fewer groups


In pd.groupby() using a pd.Grouper() along with a column fruit is showing less number of groups shown in #4 below. There should have been fruit with other dates because those are there in the final output in #5.

For example there is a (2020-01-01 00:00:00, 'mango') group but no (2020-01-01 00:00:00, 'orange') group in #4 etc. May be I am missing something. Thank you for your help.

Code is below:

# Library
import pandas as pd

# Data
date = [pd.Timestamp('01/01/2020'),
          pd.Timestamp('01/03/2020'),
          pd.Timestamp('01/20/2020'),
          pd.Timestamp('09/01/2020'),
          pd.Timestamp('09/03/2020'),
          pd.Timestamp('09/20/2020'),                                     
          pd.Timestamp('12/01/2020'),
          pd.Timestamp('12/03/2020'),
          pd.Timestamp('12/20/2020')
       ]

df = pd.DataFrame({
    'fruits': ['mango','mango','orange','orange','banana', 'mango', 'orange','banana', 'banana'],
    'price': [10,12,7,9,3,1,2,11,13],
    'date': date
})


# Grouper
# 1MS: month start frequency
p = pd.Grouper(freq='1MS', key='date')
print("#1-\n", p, '\n')

g = df.groupby(['fruits'])
print("#2-\n", g.groups, '\n')

g = df.groupby([p])
print("#3-\n", g.groups, '\n')

g = df.groupby([p, 'fruits'])
print("#4-\n", g.groups, '\n')

result = g.sum()
print("\n\n#5- result:\n", result)

Output:

#1-
 TimeGrouper(key='date', freq=<MonthBegin>, axis=0, sort=True, dropna=True, closed='left', label='left', how='mean', convention='e', origin='start_day') 

#2-
 {'banana': [4, 7, 8], 'mango': [0, 1, 5], 'orange': [2, 3, 6]} 

#3-
 {2020-01-01 00:00:00: [0, 1, 2], 2020-02-01 00:00:00: [], 2020-03-01 00:00:00: [], 2020-04-01 00:00:00: [], 2020-05-01 00:00:00: [], 2020-06-01 00:00:00: [], 2020-07-01 00:00:00: [], 2020-08-01 00:00:00: [], 2020-09-01 00:00:00: [3, 4, 5], 2020-10-01 00:00:00: [], 2020-11-01 00:00:00: [], 2020-12-01 00:00:00: [6, 7, 8]} 

#4-
 {(2020-01-01 00:00:00, 'mango'): [0], (2020-09-01 00:00:00, 'mango'): [1], (2020-12-01 00:00:00, 'orange'): [2]} 



#5- result:
                    price
date       fruits       
2020-01-01 mango      22
           orange      7
2020-09-01 banana      3
           mango       1
           orange      9
2020-12-01 banana     24
           orange      2

Solution

  • You found bug, already reported - BUG: pd.Grouper with a datetime key in conjunction with another key generates incorrect number of group keys. #51158