Search code examples
pythondataframepysparkapache-spark-sqlpyspark-pandas

Compare two couple of columns from two different pyspark dataframe to display the data that are different


i've got this dataframe with four columns

df1 = spark.createDataFrame([
    ('c', 'd', 3.0, 4),
    ('c', 'd', 7.3, 8),
    ('c', 'd', 7.3, 2),
    ('c', 'd', 7.3, 8),
    ('e', 'f', 6.0, 3),
    ('e', 'f', 6.0, 8),
    ('e', 'f', 6.0, 3),
    ('c', 'j', 4.2, 3),
    ('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])

df1.show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  c|  d|3.0|  4|
|  c|  d|7.3|  8|
|  c|  d|7.3|  2|
|  c|  d|7.3|  8|
|  e|  f|6.0|  3|
|  e|  f|6.0|  8|
|  e|  f|6.0|  3|
|  c|  j|4.2|  3|
|  c|  j|4.3|  9|
+---+---+---+---+

and i also got this other dataframe df2 with the same schema as the dataframe df1


df2 = spark.createDataFrame([
    ('c', 'd', 3.0, 4),
    ('c', 'd', 3.3, 5),
    ('c', 'd', 7.3, 2),
    ('c', 'd', 7.3, 7),
    ('e', 'f', 6.0, 3),
    ('c', 'j', 4.2, 1),
    ('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  c|  d|3.0|  4|
|  c|  d|3.3|  5|
|  c|  d|7.3|  2|
|  c|  d|7.3|  7|
|  e|  f|6.0|  3|
|  c|  j|4.2|  1|
|  c|  j|4.3|  9|
+---+---+---+---+

I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this

df3
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  c|  d|3.3|  5|
|  c|  d|7.3|  7|
|  c|  j|4.2|  1|
+---+---+---+---+


Solution

  • I think what you want is:

    df2.subtract(df1.intersect(df2)).show()
    

    I want what is in df2 that is not in both df1 and df2.

    +---+---+---+---+
    |  a|  b|  c|  d|
    +---+---+---+---+
    |  c|  j|4.2|  1|
    |  c|  d|3.3|  5|
    |  c|  d|7.3|  7|
    +---+---+---+---+
    

    I also agree with @pltc that call out you might have made a mistake in your output table.