Search code examples
pysparkamazon-deequ

Deequ satisfies function not behaving as expected


I am using pydeequ to run some checks on data, however it is not behaving as expected. One of my columns should contain any values between 0 and 1. The data looks like this

|col 1      |
| 0.5635412 |
| 0.123     |
| 1.0       |


check = Check(spark, CheckLevel.Warning, "DQ Check")
result = VerificationSuite(spark)\
    .onData(df)\
    .addCheck(check
        .satisfies("col1 BETWEEN 0 AND 1", "range check", lambda x: x==1))\
    .run()

result_df = VerificationResult.checkResultsAsDataFrame(spark, result)

THe result is returning a failure with the message

Value: 0.5635412 does not meet the constraint requirement!

Can anyone advise on where I have gone wrong?


Solution

  • I realised there were a couple of null values in the data I hadn't expected.

    Updated code to

    check = Check(spark, CheckLevel.Warning, "DQ Check")
    result = VerificationSuite(spark)\
    .onData(df)\
    .addCheck(check
        .satisfies("col1 BETWEEN 0 AND 1 OR col1 IS NULL", "range check", lambda x: x==1))\
    .run()
    
    result_df = VerificationResult.checkResultsAsDataFrame(spark, result)