I have the following spreadsheet with two-row header as part of workbook.
Name Age Genre Monthly Annually Department Region
P1 28 Female 1000 12000 I.T. North
P2 25 Male 1100 13200 I.T. South
P3 29 Female 1500 18000 Human Ressources North
The goal is getting a data frame containing all data and one-level columns, where we get "./Genre", "Salary/Monthly", "Salary/Anually", "./Department", and so on. Even cells D1 and E1 being merged, Pandas looks at E1 such as C1 or F1 (empty content).
Based on this post, I solved partially the issue adopting the following code:
import pandas as pd
minimal_df = test_df = pd.read_excel(io="multiIndex_columns.xlsx", sheet_name="sample", header=None, skiprows=2, nrows=3)
minimal_index = pd.read_excel(io="multiIndex_columns.xlsx", sheet_name="sample", header=None, nrows=2)
If we see how minimal_index
is presented, we find "Salary" at column 3 but not at 4.
0 1 2 3 4 5 6
0 NaN NaN NaN Salary NaN NaN NaN
1 Name Age Genre Monthly Anually Department Region
Anyway, it's done assignment of column names to the data frame after NA values from auxiliary data frame have been filled.
minimal_index.fillna(method='ffill', axis=1, inplace=True)
minimal_df.columns = pd.MultiIndex.from_arrays(minimal_index.values)
nan/Name nan/Age nan/Genre Salary/Monthly Salary/Annually Salary/Department Salary/Region
P1 28 Female 1000 12000 I.T. North
P2 25 Male 1100 13200 I.T. South
P3 29 Female 1500 18000 Human Ressources North
Above, we check through PyCharm Community Edition 2023.1.4 the final result. However, columns Department and Region turn related to Salary also due to "ffill" method used by fillna()
. On the other hand, I can assign nothing but keeping NaN values on first three columns because fillna()
can't allow "value" and "method" arguments simultaneously.
Is there a way to solve both parts, avoiding file modifications by Excel? What solution, based on Pandas or another library, could recognize merging on input cells?
I don't think this is possible with pandas. I would use openpyxl to handle the merged cells :
import openpyxl
ws = openpyxl.load_workbook("file.xlsx")["sample"]
TR, BR = 1, 2 # <-- top and bottom r-indices holding the header
SR = 3 # <-- the row indice where the table starts (excluding the header)
def get_val(x, sheet=ws):
if isinstance(x, openpyxl.cell.cell.MergedCell):
for r in sheet.merged_cells.ranges:
if r.__contains__(x.coordinate):
return r.start_cell.value
else:
return x.value
header = [
f"{get_val(c1)}/{get_val(c2)}" if get_val(c1) else f"./{c2.value}"
for (c1,c2) in zip(ws[TR], ws[BR])
]
data = [row for row in ws.iter_rows(min_row=SR, values_only=True)]
df = pd.DataFrame(data, columns=header)
Output (df) :
Info/Name | Info/Age | Info/Genre | ./Id | Salary/Monthly | Salary/Anually | ./Department | ./Region |
---|---|---|---|---|---|---|---|
P1 | 28 | Female | id001 | 1000 | 12000 | I.T. | North |
P2 | 25 | Male | id002 | 1100 | 13200 | I.T. | South |
P3 | 29 | Female | id003 | 1500 | 18000 | Human Resources | North |
Used input (file.xlsx) :