Search code examples
python-3.xpandasmulti-index

How to Convert a Normal Dataframe into MultiIndex'd based on certain condition


After a long While I visited to the SO's pandas section and got a question which is not indeed nicely framed thus thought to put here in an explicit way as similar kind of situation I'm too as well :-)

Below is the data frame construct:

>>> df
       measure      Pend Job       Run Job       Time
cls
ABC  [inter, batch]     [101, 93]   [302, 1327]  [56, 131]
DEF  [inter, batch]  [24279, 421]  [4935, 5452]  [75, 300]

Desired output would be...

I tried working hard but didn't get any solution, thus though to Sketch it here as that's somewhat I would like it be achieved.

----------------------------------------------------------------------------------
    |                 |Pend Job     |       Run Job       |     Time             |
cls | measure         |-----------------------------------------------------------
    |                 |inter | batch|       |inter | batch|     |inter | batch   |
----|-----------------|------|------|-------|------|------|-----|------|----------                    
ABC |inter, batch     |101   |93    |       |302   |1327  |     |56    |131      |
----|-----------------|-------------|-------|------|------|-----|------|---------|
DEF |inter, batch     |24279 |421   |       |4935  |5452  |     |75    |300      |
----------------------------------------------------------------------------------

Saying that I want my dataFrame into MultiIndex Dataframe where Pend Job , Run Job, and Time to be on the top as above.

Edit:

cls is not in the columns


Solution

  • This is my approach, you can modify it to your need:

    s = (df.drop('measure', axis=1)                   # remove the measure column
           .set_index(df['measure'].apply(', '.join), 
                      append=True)                    # make `measure` second level index
           .stack().explode().to_frame()              # concatenate all the values
        )
    
    # assign `inter` and `batch` label to each new cell
    new_lvl = np.array(['inter','batch'])[s.groupby(level=(0,1,2)).cumcount()]
    # or
    # new_lvl = np.tile(['inter', 'batch'], len(s)//2)
    
    (s.set_index(new_level, append=True)[0]
      .unstack(level=(-2,-1)
      .reset_index()
    )
    

    Output:

       cls       measure Pend Job      
                            inter batch
    0  ABC  inter, batch      101    93
    1  DEF  inter, batch    24279   421