Search code examples
scalaunit-testingscalatest

Scala - unit testing a Column type function


I have a function isJSON() that return a comparison of type Column.

  def isJSON( element: Column ): Column = {
    element.contains("{") && element.contains("}")
  }

This is how I use it usually and it works as expected:

df.withColumn("is_json", isJSON( col("data") ))

I'm trying to write a Unit test using FunSpec but I'm not able to assert on Column type of data.

describe("isJSON()") {
  it("should return false if data is not JSON") {
    val df = Seq( "Not a JSON" ).toDF( "data" )
    assert( isJSON( df("data") ).equals( lit( false ) ))
  }
}

Unit test errors out with following stacktrace:

ScalaTestFailureLocation: com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1 at (DatalakeFunSpecTest.scala:29)
org.scalatest.exceptions.TestFailedException: datalake.this.`package`.isJSON(df.apply("data")).equals(org.apache.spark.sql.functions.lit(false)) was false
    at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
    at org.scalatest.FunSpec.newAssertionFailedException(FunSpec.scala:1626)
    at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
    at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(DatalakeFunSpecTest.scala:29)
    at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(DatalakeFunSpecTest.scala:23)
    at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(DatalakeFunSpecTest.scala:23)
    at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    at org.scalatest.Transformer.apply(Transformer.scala:22)
    at org.scalatest.Transformer.apply(Transformer.scala:20)
    at org.scalatest.FunSpecLike$$anon$1.apply(FunSpecLike.scala:422)
    at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
    at org.scalatest.FunSpec.withFixture(FunSpec.scala:1626)
    at org.scalatest.FunSpecLike$class.invokeWithFixture$1(FunSpecLike.scala:419)
    at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:431)
    at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:431)
    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:431)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.org$scalatest$BeforeAndAfter$$super$runTest(DatalakeFunSpecTest.scala:13)
    at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.runTest(DatalakeFunSpecTest.scala:13)
    at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:464)
    at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:464)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:390)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:427)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:464)
    at org.scalatest.FunSpec.runTests(FunSpec.scala:1626)
    at org.scalatest.Suite$class.run(Suite.scala:1424)
    at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1626)
    at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:468)
    at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:468)
    at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:468)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.org$scalatest$BeforeAndAfter$$super$run(DatalakeFunSpecTest.scala:13)
    at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.run(DatalakeFunSpecTest.scala:13)
    at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
    at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
    at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
    at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
    at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
    at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
    at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
    at org.scalatest.tools.Runner$.run(Runner.scala:883)
    at org.scalatest.tools.Runner.run(Runner.scala)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)

Is there any way I can write assertions for Column type or somehow extract raw value of column in Boolean and do the comparison?


Solution

  • You're testing for equality of two Column instances; These instances aren't equal - they would produce the same result if applied to your DF, but they're not equal (it's easy to apply them both to a different DF and get different results).

    One way of testing this would be to filter the DataFrame with the condition of these two Columns (the result of isJSON and lit(true)) being equal, and then assert that the size of the result is 0:

    describe("isJSON()") {
      it("should return false if data is not JSON") {
        val df = Seq("Not a JSON").toDF( "data" )
        assert(df.filter(isJSON(df("data")) === lit(true)).count() == 0)
      }
    }
    

    Another option would be to collect the results of calculating this column, and asserting all results are false, e.g.:

    describe("isJSON()") {
      it("should return false if data is not JSON") {
        val df = Seq("Not a JSON").toDF( "data" )
        val results: Array[Boolean] = df.select(isJSON(df("data"))).collect().map { case Row(b: Boolean) => b }
        assert(results sameElements Array(false))
      }
    }
    

    There are many other similar options, the important concept here is comparing data instead of Column objects - as long as the compared types in the assert expression are columns, you're not comparing actual results.