Search code examples
pythonpandasxlsx

Trouble merging .xlsx files with pandas


I am operating with python 2.7 and I wrote a script that should take the name of two .xlsx files, use pandas to convert them in two dataframe and then concatenate them. The two files under consideration have the same rows and different columns. Basically, I have these two Excel files:

enter image description here enter image description here

I would like to keep the same rows and just unite the columns. The code is the following:

import pandas as pd

file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)

conc1 = pd.concat([sheet10, sheet20], sort = False)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, 'Sheet 1')
output.save()

Instead of doing what I expected (given the examples I read online), the output becomes something like this:

enter image description here

Does anyone know I could I improve my script? Thank you very much.


Solution

  • The best answer here really depends on the exact shape of your data. Based on the example you have provided it looks like the data is indexed identically between the two dataframes with differing column headers that you want preserved. If this is the case this would be the best solution:

    import pandas as pd
    
    file1 = 'file1.xlsx'
    file2 = 'file2.xlsx'
    sheet10 = pd.read_excel(file1, sheet_name = 0)
    sheet20 = pd.read_excel(file2, sheet_name = 0)
    
    conc1 = sheet10.merge(sheet20, how="left", left_index=True, right_index=True)
    output = pd.ExcelWriter('output.xlsx')
    conc1.to_excel(output, sheet_name='Sheet 1', ignore_index=True)
    output.save()
    

    Since there is a direct match between the number of rows in the two initial dataframes it doesn't really matter if a left, right, outer, or inner join is used. In this example I used a left join.

    If the rows in the two data frames do not perfectly line up though, the join method selected can have a huge impact on your output. I recommend looking at pandas documentation on merge/join/concatenate before you go any further.