Search code examples
pythonexcelpandasdataframeconcatenation

How can i merge 2 dataframes and their deduction(in case the type is not object)


I have 2 dataframes having the same columns and row numbers, some of these data frames columns are Dates(1 per dataframe), others floats and objects where strings are stored.

Let's take the below as a simplified example for df1(my actual dataframe has hundreds of columns):

         Date               Val0        Val1       
0   2020-09-29 13:22:57      char1       5.34      
1   2020-09-29 13:23:12      char2       4.5       
2   2020-09-29 13:23:44      char3       Nan      
3   2020-09-29 13:24:01      char4      24       

val0 in that example is an object

df2 looks like:

  Date                    Val0        Val1 

     
0   2020-09-29 13:22:58      char6       Nan        
1   2020-09-29 13:23:12      char4       89.77      
2   2020-09-29 13:23:45                  Nan      
3   2020-09-29 13:24:01      char4      24  

  

Expected

      Date                  Date                     Val0     Val0      Val1   Val1  Val1       
0   2020-09-29 13:22:57      2020-09-29 13:22:58    char1   char6       5.34   Nan    5.34         
1   2020-09-29 13:23:12      2020-09-29 13:23:12    char2   char4       4.5   89.77  -85.27             
2   2020-09-29 13:23:44      2020-09-29 13:23:45    char3                Nan    Nan   Nan         
3   2020-09-29 13:24:01      2020-09-29 13:24:01    char4    char4       24     24     0        

As we can see the floats got deducted and we had a new column having the difference while the objects having the same column are now 'next to one another'

I do realize that having the same column names might not be the approach, example what if we want to drop one or the .2 added by default by pandas, for me the column names at this stage don't matter they can be whatever as they're aligned as demonstrated and would still help me in my next step when writing to excel which is merging the row names only, example we'd have one cell with 2 columns under it while it would also be 1 row name of val1 and 3 columns under it.

My code and approach:

I wanted to get rid of the stuff i can't deduct, in this case val0, so i dropped that column from df1 and df2 in copied dataframes then deducted the outcome dfs giving me the last column for val1 and the deduction of dates just in case i wanted it. then concatenated df1 and df_2 looking like the below correspondant to df_6:

df_minus = df_2 - df
    df5 = pd.concat([df_2, df], axis=1)
    df_6 = df5[list(sum(zip(df_2.columns, df.columns), ()))]
    print(df_6)



    Date                  Date                     Val0     Val0      Val1   Val1    
0   2020-09-29 13:22:57      2020-09-29 13:22:58    char1   char6       5.34   Nan          
1   2020-09-29 13:23:12      2020-09-29 13:23:12    char2   char4       4.5   89.77              
2   2020-09-29 13:23:44      2020-09-29 13:23:45    char3                Nan    Nan           
3   2020-09-29 13:24:01      2020-09-29 13:24:01    char4    char4       24     24 

My approach did solve the columns being objects issue, but then for the floats i though about using df_minus somehow but couldn't find a way.Some might believe i should simply take that column from df_minus and add it in df_6 but let's keep in mind that my actual data has hundreds of column so that's not an option.


Solution

  • Here's an approach that should get you what I think you're trying to do.

    We identify numeric columns for which the subtraction operation makes sense, then do the subtraction and use the results to update df_minus, a copy of one of the input dataframes. We then rename the columns in df2 by appending a suffix '.2' and columns in df_minus by appending either '.diff' for our subtraction results or '.delete' (marked for future deletion) for all other columns.

    We then use your column sorting logic, but with 3 dataframes instead of just 2, so we have df1, df2 and df_minus in the results. Finally, we delete the unwanted columns that came from df_minus.

    import pandas as pd
    import numpy as np
    df1 = pd.DataFrame([
    
             {'Date':'2020-09-29 13:22:57','Val0':'char1','Val1':5.34},
             {'Date':'2020-09-29 13:23:12','Val0':'char2','Val1':4.5},
             {'Date':'2020-09-29 13:23:44','Val0':'char3','Val1':np.NaN},
             {'Date':'2020-09-29 13:24:01','Val0':'char4','Val1':24}
    ])
    df2 = pd.DataFrame([
    
             {'Date':'2020-09-29 13:22:58','Val0':'char6','Val1':np.NaN},
             {'Date':'2020-09-29 13:23:12','Val0':'char4','Val1':89.77},
             {'Date':'2020-09-29 13:23:45','Val0':'',     'Val1':np.NaN},
             {'Date':'2020-09-29 13:24:01','Val0':'char4','Val1':24}
    ])
    print(f"df1:\n{df1}\n")
    print(f"df2:\n{df2}\n")
    
    dfNumericCols = set(c for c in df1.columns if isinstance(df1.loc[0,c], (float,int)))
    df_minus = df1.copy(deep=True)
    df_minus[list(dfNumericCols)] = df1[list(dfNumericCols)] - df2[list(dfNumericCols)]
    df_minus.rename(columns={c:c+'.diff' if c in dfNumericCols else c+'.delete' for c in df_minus.columns}, inplace=True)
    print(f"df_minus:\n{df_minus}\n")
    
    df2.rename(columns={c:c+'.2' for c in df2.columns}, inplace=True)
    df5 = pd.concat([df1, df2, df_minus], axis=1)
    df6 = df5[list(sum(zip(df1.columns, df2.columns, df_minus.columns), ()))]
    df6 = df6.drop(columns=[c for c in df6 if c[-len('.delete'):] == '.delete'], axis=1)
    print(f"df6:\n{df6}\n")
    

    Output:

    df1:
                      Date   Val0   Val1
    0  2020-09-29 13:22:57  char1   5.34
    1  2020-09-29 13:23:12  char2   4.50
    2  2020-09-29 13:23:44  char3    NaN
    3  2020-09-29 13:24:01  char4  24.00
    
    df2:
                      Date   Val0   Val1
    0  2020-09-29 13:22:58  char6    NaN
    1  2020-09-29 13:23:12  char4  89.77
    2  2020-09-29 13:23:45           NaN
    3  2020-09-29 13:24:01  char4  24.00
    
    df_minus:
               Date.delete Val0.delete  Val1.diff
    0  2020-09-29 13:22:57       char1        NaN
    1  2020-09-29 13:23:12       char2     -85.27
    2  2020-09-29 13:23:44       char3        NaN
    3  2020-09-29 13:24:01       char4       0.00
    
    df6:
                      Date               Date.2   Val0 Val0.2   Val1  Val1.2  Val1.diff
    0  2020-09-29 13:22:57  2020-09-29 13:22:58  char1  char6   5.34     NaN        NaN
    1  2020-09-29 13:23:12  2020-09-29 13:23:12  char2  char4   4.50   89.77     -85.27
    2  2020-09-29 13:23:44  2020-09-29 13:23:45  char3           NaN     NaN        NaN
    3  2020-09-29 13:24:01  2020-09-29 13:24:01  char4  char4  24.00   24.00       0.00