Search code examples
pythonpandasmulti-index

Get Frequency table from multi index dates in pandas


Here is test data

import numpy as np
import pandas as pd
import datetime

# multi-indexed dataframe via cartesian join
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame(pd.date_range(start='2016', end='2018', freq='M'))
df1['key'] = 0
df2['key'] = 0
df = df1.merge(df2, how='outer', on='key')
del df1, df2
del df['key']
df.columns = ['id','date']

df['value'] = pd.DataFrame(np.random.randn(len(df)))

df.set_index(['date', 'id'], inplace=True)
df.sort_index(inplace=True)
df.head()

output:

                  value
date       id
2016-01-31 1   0.245029
           2  -2.141292
           3   1.521566
2016-02-29 1   0.870639
           2   1.407977

There is probably a better way to generate the cartesian join, but I'm new and that is the best I could find to generate panel data that looks like mine. Anyway, my goal is to create a quick table looking at the pattern of observations to see if any are missing as it relates to time.

My goal is to create a year by month table of frequency observations. This is close to what I want:

df.groupby(pd.Grouper(level='date',freq='M')).count()

But it gives a vertical list. My data is much bigger than this small MWE so I'd like to fit it more compactly, as well as see if there are seasonal patterns (i.e. lots of observations in December or June).

It seems to me that this should work but it doesn't:

df.groupby([df.index.levels[0].month, df.index.levels[0].year]).count()

I get a ValueError: Grouper and axis must be same length error.

This gives what I'm looking for but it seems to me that it should be easier with the time index:

df.reset_index(inplace=True)
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df.groupby(['month', 'year'])['value'].count().unstack().T

output:

month  1   2   3   4   5   6   7   8   9   10  11  12
year
2016    3   3   3   3   3   3   3   3   3   3   3   3
2017    3   3   3   3   3   3   3   3   3   3   3   3

Also, since this is just a quick validation, I'd rather not reset the index, then re-establish the index (and delete month and year) each time just to see this table.


Solution

  • I think need Index.get_level_values for select first level of MultiIndex:

    idx = df.index.get_level_values(0)
    df1 = df.groupby([idx.year, idx.month])['value'].count().unstack()
    

    Or:

    df1 = df.groupby([idx.year, idx.month]).size().unstack()
    

    Difference between count and size is count omit NaNs and size not.


    print (df1)
    date  1   2   3   4   5   6   7   8   9   10  11  12
    date                                                
    2016   3   3   3   3   3   3   3   3   3   3   3   3
    2017   3   3   3   3   3   3   3   3   3   3   3   3