Search code examples
dataframescalaapache-sparkapache-spark-sqlcomparison

Compare columns from two different dataframes based on id


I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)

Example: consider cataframes df1 and df2

df1:

+---+-------+-----+
| id|student|marks|
+---+-------+-----+
|  1|  Vijay|   23|
|  4| Vithal|   24|
|  2|    Ram|   21|
|  3|  Rahul|   25|
+---+-------+-----+

df2:

+-----+--------+------+
|newId|student1|marks1|
+-----+--------+------+
|    3|   Rahul|    25|
|    2|     Ram|    23|
|    1|   Vijay|    23|
|    4|  Vithal|    24|
+-----+--------+------+

Here based on id and newId, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks

In this example student with id 2 has 21 marks but in df2 23 marks


Solution

  • df1.exceptAll(df2).show()
    // +---+-------+-----+                                                             
    // | id|student|marks|
    // +---+-------+-----+
    // |  2|    Ram|   21|
    // +---+-------+-----+