Search code examples
pythonrexceldataframeconsolidation

Merge data with varying headers in Python or R


I have an excel file with multiple sheets which need to be consolidated. However the column headers are varying from one another. Currently the data looks like this.

Sheet 1
+-------------+--------------+----------+--------+---------+---------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header | Header1 | Header2 |
+-------------+--------------+----------+--------+---------+---------+
|          17 | Data         | Data     |      0 |       0 |       0 |
|          17 | Data         | Data     |      0 |       0 |       0 |
+-------------+--------------+----------+--------+---------+---------+

Sheet 2
+-------------+--------------+----------+---------+---------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header3 | Header2 |
+-------------+--------------+----------+---------+---------+
|          15 | Data         | Data     |       0 |       0 |
|          15 | Data         | Data     |       0 |       0 |
+-------------+--------------+----------+---------+---------+

Sheet 3
+-------------+--------------+----------+---------+---------+---------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header4 | Header1 | Header3 |
+-------------+--------------+----------+---------+---------+---------+
|          16 | Data         | Data     |       0 |       0 |       0 |
|          16 | Data         | Data     |       0 |       0 |       0 |
+-------------+--------------+----------+---------+---------+---------+

OUTPUT
+-------------+--------------+----------+--------+---------+---------+---------+---------+-----------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header | Header1 | Header2 | Header3 | Header4 | SheetName |
+-------------+--------------+----------+--------+---------+---------+---------+---------+-----------+
|          17 | Data         | Data     | 0      | 0       | 0       | null    | null    | Sheet1    |
|          17 | Data         | Data     | 0      | 0       | 0       | null    | null    | Sheet1    |
|          15 | Data         | Data     | null   | null    | 0       | 0       | null    | Sheet2    |
|          15 | Data         | Data     | null   | null    | 0       | 0       | null    | Sheet2    |
|          16 | Data         | Data     | null   | 0       | null    | 0       | 0       | Sheet3    |
|          16 | Data         | Data     | null   | 0       | null    | 0       | 0       | Sheet3    |
+-------------+--------------+----------+--------+---------+---------+---------+---------+-----------+

I am relatively new to Python. I have used Pandas and numpy. I have as many as 60 sheets to work. Can anyone help me to understand how can I achieve this? If not python is there anyway other tool/method I should use? I could really use a code sample to start with.

Your help is really appreciated. Thank you in advance


Solution

  • import pandas as pd
    
    filepath = r"filePath here"
    sheets_dict = pd.read_excel(filepath, sheet_name=None)
    
    full_table = pd.DataFrame()
    
    #loop through sheets
    for name, sheet in sheets_dict.items():
        sheet['sheet'] = name
        #sheet = sheet.rename(columns=lambda x: x.split('\n')[-1])
        full_table = full_table.append (sheet)
    
    full_table.reset_index (inplace=True, drop=True)
    
    #Write to Excel
    writer = pd.ExcelWriter('consolidated_TB1.xlsx', engine='xlsxwriter')
    full_table.to_excel(writer,'Sheet1')
    
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()