Search code examples
pythonpandasperiod

Grouping Period series values in Pandas


Following on from Reading CSV file in Pandas with historical dates I have some CSV data in the form:

Object,Earliest Date
Object1,01/01/2000
Object2,01/01/1760
Object3,01/01/1520
...

which I've now read into Pandas (using Period to handle historical dates) and create a series. I'm trying to bin the series into decades, but stumbling on getting the Period values into the form groupby expects. So far I've tried (where s is the series created from_csv):

def dt_parse(s):
  try:
    d,m,y = s.split('/')
    return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')
  except:
    return pd.NaT
s2 = s['Earliest Date'].apply(dt_parse) #Create Period values
pi = pd.PeriodIndex(s2)
decades = pi.groupby(pd.Grouper(freq="120M")).count()

which fails with:

 TypeError: Argument 'labels' has incorrect type (expected numpy.ndarray, got TimeGrouper)

Trying to group on it as a series:

 decades = s2.groupby(pd.Grouper(freq="120M")).count()

fails with:

 TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Trying to group on it as a DataFrame:

df = pd.DataFrame(s2)
decades = df.groupby(pd.Grouper(freq="120M", key='Earliest Date')).size()

fails with:

AttributeError: 'Index' object has no attribute 'to_timestamp'

Not sure how else to do it ?!


Solution

  • The error messages and pandas documentation will be your friends here.

    I have no idea whether your date column contain strictly unique dates. If they are, it's trivial, just use it as the index and you can use pd.Grouper. Otherwise, define your own grouping function:

    def grouper(ind):
        y = df.loc[ind]['Earliest Date'].year 
        return y - (y % 10)
    
    # I'm assuming that df is the dataframe from pd.read_csv("/path/to/csv")
    # and that there's a column named "earliest date" 
    # that is a Period or Datetime or something with a year attribute
    gb = df.groupby(by=grouper)
    print(gb.size())