Search code examples
pythonpython-3.xpandasdataframemulti-index

Pandas: Collapse rows in a Multiindex dataframe


Below is my df:

df = pd.DataFrame({'A': [1, 1, 1, 2],
                         'B': [2, 2, 2, 3],
                         'C': [3, 3, 3, 4],
                         'D': ['Cancer A', 'Cancer B', 'Cancer A', 'Cancer B'],
                         'E': ['Ecog 9', 'Ecog 1', 'Ecog 0', 'Ecog 1'],
                         'F': ['val 6', 'val 1', 'val 0', 'val 1'],
                         'measure_m': [100, 200, 500, 300]})

print(df)

   A  B  C         D       E      F  measure_m
0  1  2  3  Cancer A  Ecog 9  val 6        100
1  1  2  3  Cancer B  Ecog 1  val 1        200
2  1  2  3  Cancer A  Ecog 0  val 0        500
3  2  3  4  Cancer B  Ecog 1  val 1        300

When I pivot this df without passing the index, I get this:

In [1280]: df.pivot(index=None, columns = ['A', 'B', 'C', 'D', 'E', 'F'])
Out[1280]: 
  measure_m                           
A         1                          2
B         2                          3
C         3                          4
D  Cancer A Cancer B Cancer A Cancer B
E    Ecog 9   Ecog 1   Ecog 0   Ecog 1
F     val 6    val 1    val 0    val 1
0     100.0      NaN      NaN      NaN
1       NaN    200.0      NaN      NaN
2       NaN      NaN    500.0      NaN
3       NaN      NaN      NaN    300.0

I want instead of 4 rows just 1 single row with all values of measure_m column, like below:

  measure_m                           
A         1                          2
B         2                          3
C         3                          4
D  Cancer A Cancer B Cancer A Cancer B
E    Ecog 9   Ecog 1   Ecog 0   Ecog 1
F     val 6    val 1    val 0    val 1
0     100.0    200.0    500.0    300.0

How to go about this?


Solution

  • Do you mean:

    df.set_index(list(df.columns[:-1])).T
    

    Output:

    A                1                          2
    B                2                          3
    C                3                          4
    D         Cancer A Cancer B Cancer A Cancer B
    E           Ecog 9   Ecog 1   Ecog 0   Ecog 1
    F            val 6    val 1    val 0    val 1
    measure_m      100      200      500      300
    

    Update a little modification to match your output:

    cols = ['A', 'B', 'C', 'D', 'E', 'F']
    
    (df.set_index(cols)
       [['measure_m']] # only need this if you have more columns
       .unstack(level=cols)
       .to_frame().T
    )
    

    Output:

      measure_m                           
    A         1                          2
    B         2                          3
    C         3                          4
    D  Cancer A Cancer B Cancer A Cancer B
    E    Ecog 9   Ecog 1   Ecog 0   Ecog 1
    F     val 6    val 1    val 0    val 1
    0       100      200      500      300