python pandas excel dataframe multi-index

Pandas dataframe with wrong level merging for multiIndex columns

I have the following spreadsheet with two-row header as part of workbook.

Name Age Genre Monthly Annually Department        Region
P1   28  Female 1000   12000    I.T.              North
P2   25  Male   1100   13200    I.T.              South
P3   29  Female 1500   18000    Human Ressources  North

The goal is getting a data frame containing all data and one-level columns, where we get "./Genre", "Salary/Monthly", "Salary/Anually", "./Department", and so on. Even cells D1 and E1 being merged, Pandas looks at E1 such as C1 or F1 (empty content).
Based on this post, I solved partially the issue adopting the following code:

import pandas as pd

minimal_df = test_df = pd.read_excel(io="multiIndex_columns.xlsx", sheet_name="sample", header=None, skiprows=2, nrows=3) 

minimal_index = pd.read_excel(io="multiIndex_columns.xlsx", sheet_name="sample", header=None, nrows=2)

If we see how minimal_index is presented, we find "Salary" at column 3 but not at 4.

      0    1      2        3        4           5       6
0   NaN  NaN    NaN   Salary      NaN         NaN     NaN
1  Name  Age  Genre  Monthly  Anually  Department  Region

Anyway, it's done assignment of column names to the data frame after NA values from auxiliary data frame have been filled.

minimal_index.fillna(method='ffill', axis=1, inplace=True)

minimal_df.columns = pd.MultiIndex.from_arrays(minimal_index.values)

nan/Name nan/Age nan/Genre Salary/Monthly Salary/Annually Salary/Department   Salary/Region
P1       28      Female    1000           12000           I.T.                North
P2       25      Male      1100           13200           I.T.                South
P3       29      Female    1500           18000           Human Ressources    North

Above, we check through PyCharm Community Edition 2023.1.4 the final result. However, columns Department and Region turn related to Salary also due to "ffill" method used by fillna(). On the other hand, I can assign nothing but keeping NaN values on first three columns because fillna() can't allow "value" and "method" arguments simultaneously.

Is there a way to solve both parts, avoiding file modifications by Excel? What solution, based on Pandas or another library, could recognize merging on input cells?

Solution

I don't think this is possible with pandas. I would use openpyxl to handle the merged cells :

import openpyxl

ws = openpyxl.load_workbook("file.xlsx")["sample"]

TR, BR = 1, 2 # <-- top and bottom r-indices holding the header
SR = 3 # <-- the row indice where the table starts (excluding the header)

def get_val(x, sheet=ws):
    if isinstance(x, openpyxl.cell.cell.MergedCell):
        for r in sheet.merged_cells.ranges:
            if r.__contains__(x.coordinate):
                return r.start_cell.value 
    else:
        return x.value

header = [
    f"{get_val(c1)}/{get_val(c2)}" if get_val(c1) else f"./{c2.value}"
    for (c1,c2) in zip(ws[TR], ws[BR])
]

data = [row for row in ws.iter_rows(min_row=SR, values_only=True)]

df = pd.DataFrame(data, columns=header)

Output (df) :

Info/Name	Info/Age	Info/Genre	./Id	Salary/Monthly	Salary/Anually	./Department	./Region
P1	28	Female	id001	1000	12000	I.T.	North
P2	25	Male	id002	1100	13200	I.T.	South
P3	29	Female	id003	1500	18000	Human Resources	North

Used input (file.xlsx) :