I'm using PyDeequ for data quality and I want to check the uniqueness of a set of columns. There is a Check method hasUniqueness
but I can't figure how to use it.
I'm trying:
check.hasUniqueness([col1, col2], ????)
But what should we use here for the assertion function in place of ????
?
Has anyone tried the check hasUniqueness
for a combination of columns?
hasUniqueness
takes a function that accepts an in/float parameter and returns a boolean :
Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.
Here's an example of usage :
df.show()
#+---+---+
#| a| b|
#+---+---+
#|foo| 1|
#|bar| 0|
#|baz| 1|
#|bar| 0|
#+---+---+
In this dataframe, the combination of columns a
and b
has 2 values that occur exactly once (foo, 1)
and (baz, 1)
so Uniqueness = 0.5
here. Let's verify it using the check constraint :
from pydeequ.checks import CheckLevel, Check
from pydeequ.verification import VerificationResult, VerificationSuite
result = VerificationSuite(spark).onData(df).addCheck(
Check(spark, CheckLevel.Warning, "test hasUniqueness")
.hasUniqueness(["a", "b"], lambda x: x == 0.5)
).run()
result_df = VerificationResult.checkResultsAsDataFrame(spark, result)
result_df.select("constraint_status").show()
#+-----------------+
#|constraint_status|
#+-----------------+
#| Success|
#+-----------------+