Here is test data
import numpy as np
import pandas as pd
import datetime
# multi-indexed dataframe via cartesian join
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame(pd.date_range(start='2016', end='2018', freq='M'))
df1['key'] = 0
df2['key'] = 0
df = df1.merge(df2, how='outer', on='key')
del df1, df2
del df['key']
df.columns = ['id','date']
df['value'] = pd.DataFrame(np.random.randn(len(df)))
df.set_index(['date', 'id'], inplace=True)
df.sort_index(inplace=True)
df.head()
output:
value
date id
2016-01-31 1 0.245029
2 -2.141292
3 1.521566
2016-02-29 1 0.870639
2 1.407977
There is probably a better way to generate the cartesian join, but I'm new and that is the best I could find to generate panel data that looks like mine. Anyway, my goal is to create a quick table looking at the pattern of observations to see if any are missing as it relates to time.
My goal is to create a year by month table of frequency observations. This is close to what I want:
df.groupby(pd.Grouper(level='date',freq='M')).count()
But it gives a vertical list. My data is much bigger than this small MWE so I'd like to fit it more compactly, as well as see if there are seasonal patterns (i.e. lots of observations in December or June).
It seems to me that this should work but it doesn't:
df.groupby([df.index.levels[0].month, df.index.levels[0].year]).count()
I get a ValueError: Grouper and axis must be same length
error.
This gives what I'm looking for but it seems to me that it should be easier with the time index:
df.reset_index(inplace=True)
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df.groupby(['month', 'year'])['value'].count().unstack().T
output:
month 1 2 3 4 5 6 7 8 9 10 11 12
year
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Also, since this is just a quick validation, I'd rather not reset the index, then re-establish the index (and delete month and year) each time just to see this table.
I think need Index.get_level_values
for select first level of MultiIndex
:
idx = df.index.get_level_values(0)
df1 = df.groupby([idx.year, idx.month])['value'].count().unstack()
Or:
df1 = df.groupby([idx.year, idx.month]).size().unstack()
Difference between count
and size
is count
omit NaN
s and size
not.
print (df1)
date 1 2 3 4 5 6 7 8 9 10 11 12
date
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3