Search code examples
pythonpandasmergexlsx

why am getting the different output when using the concatenation statement in python merge for an xlsx files with pandas


I am trying to read 2 xlsx files and trying to append the data to a single xls file. Sample input :

data in f1 :

id na ln 
a1 b1 c1
a2 b2 c2 
k1 k2 k3

data in f2 :

jd kd rd md nd pd  
a1 b1 c1 d1 e1 f1
a2 b2 c2 d2 e2 f2
a3 b3 c3 d3 e3 f3

The first row is the column names. I used the below code to merge based on the columns.

 import pandas as pd
 df1 = pd.read_excel('f1.xlsx',usecols=[0,1,2],header=None)
 df2 = pd.read_excel('f2.xlsx',usecols=[0,4,5],header=None)
 merge_df = pd.concat([df1 ,df2])

When I execute this , I am getting the 5 columns , where as I am expecting 3 columns. If I create a new file from f2 with 3 columns and use usecols=[0,1,2] , I am getting 3 columns. What's wrong in using different columns for the merge?

Hi, here is the print output from merge_df

    0    1    2    4    5
0  id   na   ln  NaN  NaN
1  a1   b1   c1  NaN  NaN
2  a2   b2   c2  NaN  NaN
3  k1   k2   k3  NaN  NaN
0  jd  NaN  NaN   nd   pd
1  a1  NaN  NaN   e1   f1
2  a2  NaN  NaN   e2   f2
3  a3  NaN  NaN   e3   f3
   df2 : 
      0   4   5
  0  jd  nd  pd
  1  a1  e1  f1
  2  a2  e2  f2
  3  a3  e3  f3

Solution

  • The columns get the numbers that you use for importing. I.e. df1 will have column "names" 0, 1 and 2. df2 wil have column names 0, 4 and 5.

    You want to rename the columns like so:

    df2 = df2.rename(columns={4: 1, 5: 2})
    

    This will give you the output

        0   1   2
    0  id  na  ln
    1  a1  b1  c1
    2  a2  b2  c2
    3  k1  k2  k3
    0  jd  nd  pd
    1  a1  e1  f1
    2  a2  e2  f2
    3  a3  e3  f3
    

    This is probably still not what you want, since id, na and ln are the headers, because you wrote

    The first row is the column names.

    You probably want something like

    df1 = pd.read_excel('f1.xlsx', usecols=[0, 1, 2], header=0)    # Use headers
    df2 = pd.read_excel('f2.xlsx', usecols=[0, 4, 5], header=0)    # Use headers
    df2 = df2.rename(columns={"jd": "id", "nd": "na", "pd":"ln"})  # Adapt headers to match