Search code examples
scalaazure-databricksamazon-deequ

Azure DataBricks - Deequ - Finding Rows that failed on a check


I followed https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ and got running with the checks and verification etc.

But I am not able to find out , on which rows exactly my data is failing. That is a very important part , that I need the rows which have failed the check.

I tried following : https://github.com/awslabs/deequ/blob/master/src/test/scala/com/amazon/deequ/schema/RowLevelSchemaValidatorTest.scala But, I am getting error databricks while running codes from this link :

error: object SparkContextSpec is not a member of package com.amazon.deequ
import com.amazon.deequ.SparkContextSpec
       ^
command-4342528364312961:24: error: not found: type SparkContextSpec
class RowLevelSchemaValidatorTest extends WordSpec with SparkContextSpec {
                                                        ^
command-4342528364312961:28: error: not found: value withSparkSession
    "correctly enforce null constraints" in withSparkSession { sparkSession =>
                                            ^
command-4342528364312961:39: error: not found: value RowLevelSchema
      val schema = RowLevelSchema()
                   ^
command-4342528364312961:40: error: not found: value isNullable
        .withIntColumn("id", isNullable = false)

And the list goes on.

Please help.

Thanks


Solution

  • The problems you are encountering are likely due to an incorrect project setup. Are you running the tests from your IDE? If not, I would recommend you make sure that the code, in IntelliJ for example, compiles. The unit-tests should then be executable from there.

    IntelliJ comes with a Maven plugin that allows importing projects.