Search code examples
pythonpandasdataframeconcatenation

concat or merge to data frame tables that do not have primary keys


I have 2 data frames

    Name    Middle Name Last Name
0   Ambar   Klein   Liam
1   Maya    J   Jacobson
2   Steve   Knowles Gomez
3   Sam l   Smith
4   Gomez   John    Smith
5   Mical   M   Mason     


Name    Last Name
0   Smith   Klein
1   Sams    Peter
2   Betts   Knowles

I am trying to concat these 2 data frames that relatively have the same data but dont necessarily have any joining fields. My goal is to have all the data in one DF rather than having to work from multiple dfs.

I currently have the following code, but when i concat all the columns are not concat into one table, but rather some column are missing and show up in the row.

test_li = []

 

for test_filename in test:

test_df = pd.read_excel(test_filename, index_col=None, header=None)

test_li.append(test_df)



test_frame = (

            pd.concat(test_li , axis=0, ignore_index=True)

                .dropna(how="all") #to get rid of the eventual extra rows above each header

                .drop_duplicates() #to get rid of the cumulated duplicated headers

                .T.set_index(0).T #to make the first row as header of the dataframe

                # .drop(list(df.filter(regex=string)),axis=1,inplace=True)

        )

test_frame = test_frame.loc[:,test_frame.columns.notnull()]

test_frame

the above code produces the current output: However in row 13 you can see that one of my tables had Only the first and last name as columns whereas the second table had first, middle and last name. And during the concat process, the headers from one of the tables is converted into row 13.

Current output

Name    Middle Name Last Name
1   Ambar   Klein   Liam
2   Maya    J   Jacobson
3   Steve   Knowles Gomez
4   Sam l   Smith
5   Gomez   John    Smith
6   Mical   M   Mason
12  NaN NaN t
13  Name    Last Name   NaN
14  Smith   Klein   NaN
15  Sams    Peter   NaN
16  Betts   Knowles NaN

I would like to be able to concat my dfs and have the output in the format below, and to where all the columns from both dfs are merged together.

Desired Output

Name      Middle Name   Last Name
0   Ambar   Klein      Liam
1   Maya    J         Jacobson
2   Steve   Knowles     Gomez
3   Sam l   Smith
4   Gomez   John    Smith
5   Mical   M       Mason
6   Smith   -        Klein
7   Sams    -      Peter
8   Betts   -      Knowles

Solution

  • You should import your data with a correct header. Use the default header=0 parameter of read_excel, not header=None:

    test_li = []
    for test_filename in test:
        test_li.append(pd.read_excel(test_filename, index_col=None))
    

    Then a simple concat will work directly:

    test_frame = (pd.concat(test_li, ignore_index=True)
                    .dropna(how='all') # optional
                    .drop_duplicates() #
                  )