Search code examples
pythonpandasdataframedata-science

Converting multi column header df from wide to long format in python


I would like to convert an excel file from wide format into long format.

I'm reading an excel file which has not only two rows of headers, it also includes merged cells in the header. Input Example:

| Task | Name | May,2022  | Jun,2022  | Jul,2022  |
|      |      |status|type|status|type|status|type|
| ---- | ---- |------|----|------|----|------|----|
| 1    | Tom  |1     |AB  |1     |DT  |0     |AB  | 
| 2    | John |0     |DT  |1     |CC  |0     |GH  |

Desired Output:

Task Name Date status type
1 Tom May,2022 1 AB
1 Tom Jun,2022 1 DT
1 Tom Jul,2022 0 AB
2 John May,2022 0 DT
2 John Jun,2022 1 CC
2 John Jul,2022 0 GH

What i have:

import pandas as pd
df = pd.read_excel(path)
df = df.melt(id_vars=["Task","Name"],
     var_name="Date",
     value_name="Value")

I don't know how to deal with the multi header. I tried combining the two headers.

import pandas as pd
df = pd.read_excel(path, header[1,2])
df.columns = df.columns.to_series().ffill().values
df.columns.value[0]='Task'
df.columns.value[1]='Name'
df = df.melt(id_vars=["Task","Name"],
     var_name="Date",
     value_name="Value")

This results in having something like: (I'm fine with the date transformation)

Task Name (2022-05-01 00:00:00, status) (2022-05-01 00:00:00, type)
1 Tom 1 AB
2 John 0 DT

The melt function returns a KeyError 'Task'. Even if this would work i would somehow need to split the date from the status and type into seperate columns. Someone knows how to deal with such data?


Solution

  • You can use set_index to preserve Task and Name columns then stack Date column and finally reset_index to get expected output:

    >>> (df.set_index(['Task', 'Name']).rename_axis(columns=['Date', None])
           .stack(level=0).reset_index())
    
       Task  Name      Date  status type
    0     1   Tom  Jul,2022       0   AB
    1     1   Tom  Jun,2022       1   DT
    2     1   Tom  May,2022       1   AB
    3     2  John  Jul,2022       0   GH
    4     2  John  Jun,2022       1   CC
    5     2  John  May,2022       0   DT
    

    Minimal Reproducible Example:

    data = {'index': [0, 1],
            'columns': [('Task', ''),
             ('Name', ''),
             ('May,2022', 'status'),
             ('May,2022', 'type'),
             ('Jun,2022', 'status'),
             ('Jun,2022', 'type'),
             ('Jul,2022', 'status'),
             ('Jul,2022', 'type')],
            'data': [[1, 'Tom', 1, 'AB', 1, 'DT', 0, 'AB'],
                     [2, 'John', 0, 'DT', 1, 'CC', 0, 'GH']],
            'index_names': [None],
            'column_names': [None, None]}
    df = pd.DataFrame.from_dict(data, orient='tight')