I have a excel file with multi-columns as below (Sorry but I don't know how to recreate it with pandas):
Below is my expected Output:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Code': ['11000000000', '11200100000', '11710000000', '11000000000', '11200100000', '11710000000', '11000000000', '11200100000', '11710000000'],
'Code Name': ['Car', 'Motorbike', 'Bike', 'Car', 'Motorbike', 'Bike', 'Car', 'Motorbike', 'Bike'],
'Date': ['19-02-2024', '19-02-2024', '19-02-2024', '19-02-2024', '19-02-2024', '19-02-2024', '19-02-2024', '19-02-2024', '19-02-2024'],
'Customer': ['Customer A', 'Customer A', 'Customer A', 'Customer B', 'Customer B', 'Customer B', 'Customer ...', 'Customer ...', 'Customer ...'],
'Point_1': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Point_2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
df
Code Code Name Date Customer Point_1 Point_2
0 11000000000 Car 19-02-2024 Customer A NaN NaN
1 11200100000 Motorbike 19-02-2024 Customer A NaN NaN
2 11710000000 Bike 19-02-2024 Customer A NaN NaN
3 11000000000 Car 19-02-2024 Customer B NaN NaN
4 11200100000 Motorbike 19-02-2024 Customer B NaN NaN
5 11710000000 Bike 19-02-2024 Customer B NaN NaN
6 11000000000 Car 19-02-2024 Customer ... NaN NaN
7 11200100000 Motorbike 19-02-2024 Customer ... NaN NaN
8 11710000000 Bike 19-02-2024 Customer ... NaN NaN
What should I do to get this result. Thank you
Create DataFrame with MultiIndex
first in index and columns by parameters index_col
and header
and then use DataFrame.stack
with first level and dropna
parameter for avoid remove rows with missing values:
df = pd.read_excel(file, index_col=[0,1,2], header=[0,1])
#test MultiIndex in columns
print (df.columns)
MultiIndex([('Customer A', 'Point_1'),
('Customer A', 'Point_2'),
('Customer B', 'Point_1'),
('Customer B', 'Point_2'),
('Customer ...', 'Point_1'),
('Customer ...', 'Point_2')],
names=['Customer', None])
#test MultiIndex in index
print (df.index)
MultiIndex([('11000000000', 'Car', '19-02-2024'),
('11200100000', 'Motorbike', '19-02-2024'),
('11710000000', 'Bike', '19-02-2024')],
names=['Code', 'Code Name', 'Date'])
EDIT: There is problem with missing values in headers, so is possible use alternative solution - create MultiIndex first in columns and use DataFrame.set_index
with DataFrame.rename_axis
:
df = pd.read_excel('file.xls', header=[0,1])
df = df.set_index(df.columns[:3].tolist()).rename_axis(df.columns[:3].get_level_values(0))
print (df)
Customer A Customer B \
Point_1 Point_2 Point_1 Point_2
Code Code Name Date
11000000000 Car 19-02-2024 NaN NaN NaN NaN
11200100000 Motorbike 19-02-2024 NaN NaN NaN NaN
11710000000 Bike 19-02-2024 NaN NaN NaN NaN
Customer …
Point_1 Point_2
Code Code Name Date
11000000000 Car 19-02-2024 NaN NaN
11200100000 Motorbike 19-02-2024 NaN NaN
11710000000 Bike 19-02-2024 NaN NaN
out = df.stack(0, dropna=False).reset_index()
print (out)
Code Code Name Date Customer Point_1 Point_2
0 11000000000 Car 19-02-2024 Customer ... NaN NaN
1 11000000000 Car 19-02-2024 Customer A NaN NaN
2 11000000000 Car 19-02-2024 Customer B NaN NaN
3 11200100000 Motorbike 19-02-2024 Customer ... NaN NaN
4 11200100000 Motorbike 19-02-2024 Customer A NaN NaN
5 11200100000 Motorbike 19-02-2024 Customer B NaN NaN
6 11710000000 Bike 19-02-2024 Customer ... NaN NaN
7 11710000000 Bike 19-02-2024 Customer A NaN NaN
8 11710000000 Bike 19-02-2024 Customer B NaN NaN