Search code examples
pythonpandaspivotmulti-index

pandas: how to run a pivot with a multi-index?


I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.

The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?

Minimal code copied below. Thanks a lot!

PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.

import pandas as pd
import numpy as np

df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)


df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))

# This doesn't work: 
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')

# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')

# This below works but is not ideal: 
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']

mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)  
mypiv['month'] = mypiv['new field'] % 100

Solution

  • You can group and then unstack.

    >>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
    item        item 1  item 2
    year month                
    2004 1          33     250
         2          44     224
         3          41     268
         4          29     232
         5          57     252
         6          61     255
         7          28     254
         8          15     229
         9          29     258
         10         49     207
         11         36     254
         12         23     209
    

    Or use pivot_table:

    >>> df.pivot_table(
            values='value', 
            index=['year', 'month'], 
            columns='item', 
            aggfunc=np.sum)
    item        item 1  item 2
    year month                
    2004 1          33     250
         2          44     224
         3          41     268
         4          29     232
         5          57     252
         6          61     255
         7          28     254
         8          15     229
         9          29     258
         10         49     207
         11         36     254
         12         23     209