I have a data frame that looks like this...
name datetime value
1 Michael 2012-01-11 01:11:19 3999
3 Janus 2014-01-11 00:01:10 1
4 Janus 2014-01-11 01:12:29 81
5 Janus 2014-01-11 01:11:19 8047
... ... ... ...
9001230 Excalibur1 2014-12-25 01:11:10 1
9001231 Excalibur1 2014-12-25 02:11:10 2
9001232 Excalibur1 2013-12-25 03:11:10 3
9001233 Excalibur1 2014-12-25 04:11:10 4
9001234 Excalibur1 2014-12-25 05:11:10 907
I want to find, for each name, how many the total value each day.
In days with no values listed, I want these days to still exist, and I want to fill zeros.
I can do the following, which seems to only sometimes zero fill:
df.set_index(["datetime"]).groupby("name").resample("D")["value"].sum()
name datetime
Michael 2012-01-11 3999
Janus 2014-01-11 8129
2014-10-12 0
2014-10-13 0
2014-10-14 0
...
Excalibur1 2014-12-21 0
2014-12-22 1
2014-12-23 0
2014-12-24 0
2014-12-25 917
Why is it /sometimes/ filling with zeros and what's the right way to rectangularize this so I have all dates summed and filled with zero where no records exist?
The index I would like is composed of...
index=pd.MultiIndex.from_product([
pd.date_range(
df["datetime"].dt.date.min(),
df["datetime"].dt.date.max(), freq="D"),
sorted(df["name"].unique()
)
And dense.
Don't resample
, but use the date in the groupby
:
df['datetime'] = pd.to_datetime(df['datetime'])
df.groupby(['name', df['datetime'].dt.date]).sum()
Or, using pandas.Grouper
for flexibility:
df.groupby(['name', pd.Grouper(key='datetime', freq='D')]).sum()
Output:
value
name datetime
Excalibur1 2013-12-25 3
2014-12-25 914
Janus 2014-01-11 8129
Michael 2012-01-11 3999
For a rectangular shape use:
df2 = df.groupby(['name', pd.Grouper(key='datetime', freq='D')])['value'].sum().unstack(level='name', fill_value=0)
Output:
name Excalibur1 Janus Michael
datetime
2013-12-25 3 0 0
2014-12-25 914 0 0
2014-01-11 0 8129 0
2012-01-11 0 0 3999
And to add missing dates, reindex
:
df2 = df.groupby(['name', pd.Grouper(key='datetime', freq='D')])['value'].sum().unstack(level='name', fill_value=0)
df2 = df2.reindex(pd.date_range(df['datetime'].dt.date.min(), df['datetime'].max()), fill_value=0)
Output:
name Excalibur1 Janus Michael
2012-01-11 0 0 3999
2012-01-12 0 0 0
2012-01-13 0 0 0
2012-01-14 0 0 0
2012-01-15 0 0 0
...