python apache-spark pyspark data-quality amazon-deequ

How to use hasUniqueness check in PyDeequ?

I'm using PyDeequ for data quality and I want to check the uniqueness of a set of columns. There is a Check method hasUniqueness but I can't figure how to use it.

I'm trying:

check.hasUniqueness([col1, col2], ????)

But what should we use here for the assertion function in place of ?????

Has anyone tried the check hasUniqueness for a combination of columns?

Solution

hasUniqueness takes a function that accepts an in/float parameter and returns a boolean :

Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.

Here's an example of usage :

df.show()
#+---+---+
#|  a|  b|
#+---+---+
#|foo|  1|
#|bar|  0|
#|baz|  1|
#|bar|  0|
#+---+---+

In this dataframe, the combination of columns a and b has 2 values that occur exactly once (foo, 1) and (baz, 1) so Uniqueness = 0.5 here. Let's verify it using the check constraint :

from pydeequ.checks import CheckLevel, Check
from pydeequ.verification import VerificationResult, VerificationSuite

result = VerificationSuite(spark).onData(df).addCheck(
    Check(spark, CheckLevel.Warning, "test hasUniqueness")
        .hasUniqueness(["a", "b"], lambda x: x == 0.5)
).run()

result_df = VerificationResult.checkResultsAsDataFrame(spark, result)

result_df.select("constraint_status").show()

#+-----------------+
#|constraint_status|
#+-----------------+
#|          Success|
#+-----------------+