Search code examples
dataframepysparkapache-spark-sqlpyspark-pandaspyspark-schema

compare two dataframes and display the data that are different


i have two dataframes and i want to compare the values of two columns and display those who are different, for exemple: compare this Table 1

A B C D
O1 2 E1 2
O1 3 E1 1
O1 2 E1 0
O1 5 E2 2
O1 2 E2 3
O1 2 E2 2
O1 5 E2 1
O2 8 E1 2
O2 8 E1 0
O2 0 E1 1
O2 2 E1 4
O2 9 E1 2
O2 2 E2 1
O2 9 E2 4
O2 2 E2 2

with this table 2

A B C D
O1 2 E1 2
O1 2 E2 3
O2 2 E1 4
O2 9 E2 4

i tried

cond= [table1.A ==  table2.A, table1.C ==  table2.C, table1.D ==  table2.D]
join = table1.join(table2,cond,"leftsemi")

and since i have a lot of data in it, i don't know how to check if the result i've got is correct


Solution

  • Since your dataframes has the same schema, you can use subtract

    df1
    df1 = spark.createDataFrame([
        (1, 2, 3, 4),
        (5, 6, 7, 8),
    ], ['a', 'b', 'c', 'd'])
    
    +---+---+---+---+
    |  a|  b|  c|  d|
    +---+---+---+---+
    |  1|  2|  3|  4|
    |  5|  6|  7|  8|
    +---+---+---+---+
    
    df2
    df2 = spark.createDataFrame([
        (5, 6, 7, 8),
    ], ['a', 'b', 'c', 'd'])
    
    +---+---+---+---+
    |  a|  b|  c|  d|
    +---+---+---+---+
    |  5|  6|  7|  8|
    +---+---+---+---+
    
    subtract to get data that exists in df1 but does not exists in df2
    df1.subtract(df2).show()
    
    +---+---+---+---+
    |  a|  b|  c|  d|
    +---+---+---+---+
    |  1|  2|  3|  4|
    +---+---+---+---+