Search code examples
pythonpandasexceldataframemulti-index

Pandas dataframe with wrong level merging for multiIndex columns


I have the following spreadsheet with two-row header as part of workbook.

enter image description here

Name Age Genre Monthly Annually Department        Region
P1   28  Female 1000   12000    I.T.              North
P2   25  Male   1100   13200    I.T.              South
P3   29  Female 1500   18000    Human Ressources  North

The goal is getting a data frame containing all data and one-level columns, where we get "./Genre", "Salary/Monthly", "Salary/Anually", "./Department", and so on. Even cells D1 and E1 being merged, Pandas looks at E1 such as C1 or F1 (empty content).
Based on this post, I solved partially the issue adopting the following code:

import pandas as pd

minimal_df = test_df = pd.read_excel(io="multiIndex_columns.xlsx", sheet_name="sample", header=None, skiprows=2, nrows=3) 

minimal_index = pd.read_excel(io="multiIndex_columns.xlsx", sheet_name="sample", header=None, nrows=2)

If we see how minimal_index is presented, we find "Salary" at column 3 but not at 4.

      0    1      2        3        4           5       6
0   NaN  NaN    NaN   Salary      NaN         NaN     NaN
1  Name  Age  Genre  Monthly  Anually  Department  Region

Anyway, it's done assignment of column names to the data frame after NA values from auxiliary data frame have been filled.

minimal_index.fillna(method='ffill', axis=1, inplace=True)

minimal_df.columns = pd.MultiIndex.from_arrays(minimal_index.values)

Same data values color-highlighted but one-level columns with first three names associated to none and last four names to Salary

nan/Name nan/Age nan/Genre Salary/Monthly Salary/Annually Salary/Department   Salary/Region
P1       28      Female    1000           12000           I.T.                North
P2       25      Male      1100           13200           I.T.                South
P3       29      Female    1500           18000           Human Ressources    North

Above, we check through PyCharm Community Edition 2023.1.4 the final result. However, columns Department and Region turn related to Salary also due to "ffill" method used by fillna(). On the other hand, I can assign nothing but keeping NaN values on first three columns because fillna() can't allow "value" and "method" arguments simultaneously.

Is there a way to solve both parts, avoiding file modifications by Excel? What solution, based on Pandas or another library, could recognize merging on input cells?


Solution

  • I don't think this is possible with pandas. I would use to handle the merged cells :

    import openpyxl
    
    ws = openpyxl.load_workbook("file.xlsx")["sample"]
    
    TR, BR = 1, 2 # <-- top and bottom r-indices holding the header
    SR = 3 # <-- the row indice where the table starts (excluding the header)
    
    def get_val(x, sheet=ws):
        if isinstance(x, openpyxl.cell.cell.MergedCell):
            for r in sheet.merged_cells.ranges:
                if r.__contains__(x.coordinate):
                    return r.start_cell.value 
        else:
            return x.value
    
    header = [
        f"{get_val(c1)}/{get_val(c2)}" if get_val(c1) else f"./{c2.value}"
        for (c1,c2) in zip(ws[TR], ws[BR])
    ]
    
    data = [row for row in ws.iter_rows(min_row=SR, values_only=True)]
    
    df = pd.DataFrame(data, columns=header)
    

    Output (df) :

    Info/Name Info/Age Info/Genre ./Id Salary/Monthly Salary/Anually ./Department ./Region
    P1 28 Female id001 1000 12000 I.T. North
    P2 25 Male id002 1100 13200 I.T. South
    P3 29 Female id003 1500 18000 Human Resources North

    Used input (file.xlsx) :

    enter image description here