I would like to convert an excel file from wide format into long format.
I'm reading an excel file which has not only two rows of headers, it also includes merged cells in the header. Input Example:
| Task | Name | May,2022 | Jun,2022 | Jul,2022 |
| | |status|type|status|type|status|type|
| ---- | ---- |------|----|------|----|------|----|
| 1 | Tom |1 |AB |1 |DT |0 |AB |
| 2 | John |0 |DT |1 |CC |0 |GH |
Desired Output:
Task | Name | Date | status | type |
---|---|---|---|---|
1 | Tom | May,2022 | 1 | AB |
1 | Tom | Jun,2022 | 1 | DT |
1 | Tom | Jul,2022 | 0 | AB |
2 | John | May,2022 | 0 | DT |
2 | John | Jun,2022 | 1 | CC |
2 | John | Jul,2022 | 0 | GH |
What i have:
import pandas as pd
df = pd.read_excel(path)
df = df.melt(id_vars=["Task","Name"],
var_name="Date",
value_name="Value")
I don't know how to deal with the multi header. I tried combining the two headers.
import pandas as pd
df = pd.read_excel(path, header[1,2])
df.columns = df.columns.to_series().ffill().values
df.columns.value[0]='Task'
df.columns.value[1]='Name'
df = df.melt(id_vars=["Task","Name"],
var_name="Date",
value_name="Value")
This results in having something like: (I'm fine with the date transformation)
Task | Name | (2022-05-01 00:00:00, status) | (2022-05-01 00:00:00, type) |
---|---|---|---|
1 | Tom | 1 | AB |
2 | John | 0 | DT |
The melt function returns a KeyError 'Task'. Even if this would work i would somehow need to split the date from the status and type into seperate columns. Someone knows how to deal with such data?
You can use set_index
to preserve Task
and Name
columns then stack
Date
column and finally reset_index
to get expected output:
>>> (df.set_index(['Task', 'Name']).rename_axis(columns=['Date', None])
.stack(level=0).reset_index())
Task Name Date status type
0 1 Tom Jul,2022 0 AB
1 1 Tom Jun,2022 1 DT
2 1 Tom May,2022 1 AB
3 2 John Jul,2022 0 GH
4 2 John Jun,2022 1 CC
5 2 John May,2022 0 DT
Minimal Reproducible Example:
data = {'index': [0, 1],
'columns': [('Task', ''),
('Name', ''),
('May,2022', 'status'),
('May,2022', 'type'),
('Jun,2022', 'status'),
('Jun,2022', 'type'),
('Jul,2022', 'status'),
('Jul,2022', 'type')],
'data': [[1, 'Tom', 1, 'AB', 1, 'DT', 0, 'AB'],
[2, 'John', 0, 'DT', 1, 'CC', 0, 'GH']],
'index_names': [None],
'column_names': [None, None]}
df = pd.DataFrame.from_dict(data, orient='tight')