Search code examples
pythonpandasdataframeassert

How to compare 2 non-identical dataframes in python


I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.

df1=     col_id  num  name
      0   1     3     linda
      1   2     4     James

df2=     id     no   name
      0   1     2    granpa
      1   2     6    linda
      2   3     7    sam

This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:

 result   col_id        num              name   
      0   1             was 3| now 2    was linda| now granpa  
      1   2             was 4| now 6    was James| now linda
      2   was  | now 3  was  | now 7    was      | now sam

Solution

  • Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.

    However,

    DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

    So we just need to align the row/column indexes, either via merge or reindex.


    Align via merge

    1. Outer-merge the two dfs:

      merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
      #    col_id  num  name_x  id  no  name_y
      # 0       1    3   linda   1   2  granpa
      # 1       2    4   james   2   6   linda
      # 2     NaN  NaN     NaN   3   7     sam
      
    2. Divide the merged frame into left/right frames and align their columns with set_axis:

      cols = df1.columns
      left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
      #    col_id  num    name
      # 0       1    3   linda
      # 1       2    4   james
      # 2     NaN  NaN     NaN
      
      right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
      #    col_id  num    name
      # 0       1    2  granpa
      # 1       2    6   linda
      # 2       3    7     sam
      
    3. compare the aligned left/right frames (use keep_equal=True to show equal cells):

      left.compare(right, keep_shape=True, keep_equal=True)
      #        col_id         num          name
      #    self other  self other   self  other
      # 0     1     1     3     2  linda granpa
      # 1     2     2     4     6  james  linda
      # 2   NaN     3   NaN     7    NaN    sam
      
      left.compare(right, keep_shape=True)
      #        col_id         num          name
      #    self other  self other   self  other
      # 0   NaN   NaN     3     2  linda granpa
      # 1   NaN   NaN     4     6  james  linda
      # 2   NaN     3   NaN     7    NaN    sam
      

    Align via reindex

    If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.

    In your example, df1 is a subset of df2, so reindex df1:

    df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
       .set_index('id')                # set index to copied id
       .reindex(df2.id)                # reindex against df2's id
       .reset_index(drop=True)         # remove copied id
       .set_axis(df2.columns, axis=1)  # align column names
       .compare(df2, keep_equal=True, keep_shape=True)
    
    #        col_id         num          name
    #    self other  self other   self  other
    # 0     1     1     3     2  linda granpa
    # 1     2     2     4     6  james  linda
    # 2   NaN     3   NaN     7    NaN    sam
    

    Nullable integers

    Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):