Following on from Reading CSV file in Pandas with historical dates I have some CSV data in the form:
Object,Earliest Date
Object1,01/01/2000
Object2,01/01/1760
Object3,01/01/1520
...
which I've now read into Pandas (using Period to handle historical dates) and create a series. I'm trying to bin the series into decades, but stumbling on getting the Period values into the form groupby expects. So far I've tried (where s is the series created from_csv):
def dt_parse(s):
try:
d,m,y = s.split('/')
return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')
except:
return pd.NaT
s2 = s['Earliest Date'].apply(dt_parse) #Create Period values
pi = pd.PeriodIndex(s2)
decades = pi.groupby(pd.Grouper(freq="120M")).count()
which fails with:
TypeError: Argument 'labels' has incorrect type (expected numpy.ndarray, got TimeGrouper)
Trying to group on it as a series:
decades = s2.groupby(pd.Grouper(freq="120M")).count()
fails with:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Trying to group on it as a DataFrame:
df = pd.DataFrame(s2)
decades = df.groupby(pd.Grouper(freq="120M", key='Earliest Date')).size()
fails with:
AttributeError: 'Index' object has no attribute 'to_timestamp'
Not sure how else to do it ?!
The error messages and pandas documentation will be your friends here.
I have no idea whether your date column contain strictly unique dates. If they are, it's trivial, just use it as the index and you can use pd.Grouper
. Otherwise, define your own grouping function:
def grouper(ind):
y = df.loc[ind]['Earliest Date'].year
return y - (y % 10)
# I'm assuming that df is the dataframe from pd.read_csv("/path/to/csv")
# and that there's a column named "earliest date"
# that is a Period or Datetime or something with a year attribute
gb = df.groupby(by=grouper)
print(gb.size())