Search code examples
pythonpandasmulti-index

Pandas Rearranging Multiindex Series


On pandas version 0.19.2, I have the below dataframe with multiindex:

import pandas as pd
import numpy as np

arrays = [[2001, 2001, 2002,  2002, 2002, 2003, 2004, 2004],
          ['A', 'B', 'A', 'C', 'D', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index, name='signal')

Which looks like follows:

first  second
2001   A        -2.48
       B         0.95
2002   A         0.55
       C         0.65
       D        -1.32
2003   B        -0.25
2004   C         0.86
       D        -0.31

I want to get a summary contingency dataframe where columns are unique "second" and indices are the "first" index, like below:

       A      B     C     D
2001  -2.48  0.95   NaN   NaN
2002  0.55   NaN    0.65
2003  NaN    -0.25  NaN   NaN
2004  NaN    NaN    0.86  -0.31

Any idea how this can be done? I played around with groupby() as below but could not get anywhere

s.groupby(level=1).apply(lambda x: "to do")

Linked question: Python Pandas - how to do group by on a multiindex how-to-do-group-by-on-a-multiindex


Solution

  • I believe need unstack if unique pairs in MultiIndex:

    df = s.unstack()
    print (df)
    second         A         B         C         D
    first                                         
    2001    1.752237  0.348548       NaN       NaN
    2002   -0.022903       NaN -0.961702  0.079236
    2003         NaN -0.393272       NaN       NaN
    2004         NaN       NaN -0.600994 -0.594842
    

    but if in real data get:

    ValueError: Index contains duplicate entries, cannot reshape

    it means there are duplicates in MultiIndex like:

    print (s)
    first  second
    2001   A         0.478052 <-2001, A
           A         0.485261 <-2001, A
    2002   A        -0.474997
           C        -1.165866
           D        -0.755630
    2003   B         0.588104
    2004   C        -1.439245
           D        -0.461221
    Name: signal, dtype: float64
    

    Then possible solution is aggregate values first:

    print (s.groupby(level=[0,1]).mean())
    first  second
    2001   A         0.958668
    2002   A        -0.459612
           C         0.534821
           D         1.469257
    2003   B        -1.103208
    2004   C         0.098037
           D         0.722135
    Name: signal, dtype: float64
    
    df = s.groupby(level=[0,1]).mean().unstack()
    print (df)
    second         A         B         C         D
    first                                         
    2001    0.481657       NaN       NaN       NaN
    2002   -0.474997       NaN -1.165866 -0.755630
    2003         NaN  0.588104       NaN       NaN
    2004         NaN       NaN -1.439245 -0.461221