Search code examples
pythonapache-sparkpysparkdata-qualityamazon-deequ

How to use hasUniqueness check in PyDeequ?


I'm using PyDeequ for data quality and I want to check the uniqueness of a set of columns. There is a Check method hasUniqueness but I can't figure how to use it.

I'm trying:

check.hasUniqueness([col1, col2], ????) 

But what should we use here for the assertion function in place of ?????

Has anyone tried the check hasUniqueness for a combination of columns?


Solution

  • hasUniqueness takes a function that accepts an in/float parameter and returns a boolean :

    Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.

    Here's an example of usage :

    df.show()
    #+---+---+
    #|  a|  b|
    #+---+---+
    #|foo|  1|
    #|bar|  0|
    #|baz|  1|
    #|bar|  0|
    #+---+---+
    

    In this dataframe, the combination of columns a and b has 2 values that occur exactly once (foo, 1) and (baz, 1) so Uniqueness = 0.5 here. Let's verify it using the check constraint :

    from pydeequ.checks import CheckLevel, Check
    from pydeequ.verification import VerificationResult, VerificationSuite
    
    result = VerificationSuite(spark).onData(df).addCheck(
        Check(spark, CheckLevel.Warning, "test hasUniqueness")
            .hasUniqueness(["a", "b"], lambda x: x == 0.5)
    ).run()
    
    result_df = VerificationResult.checkResultsAsDataFrame(spark, result)
    
    result_df.select("constraint_status").show()
    
    #+-----------------+
    #|constraint_status|
    #+-----------------+
    #|          Success|
    #+-----------------+