I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN
and:
NULL
s on the leftNULL
s on the right