I have a function isJSON()
that return a comparison of type Column.
def isJSON( element: Column ): Column = {
element.contains("{") && element.contains("}")
}
This is how I use it usually and it works as expected:
df.withColumn("is_json", isJSON( col("data") ))
I'm trying to write a Unit test using FunSpec
but I'm not able to assert on Column
type of data.
describe("isJSON()") {
it("should return false if data is not JSON") {
val df = Seq( "Not a JSON" ).toDF( "data" )
assert( isJSON( df("data") ).equals( lit( false ) ))
}
}
Unit test errors out with following stacktrace:
ScalaTestFailureLocation: com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1 at (DatalakeFunSpecTest.scala:29)
org.scalatest.exceptions.TestFailedException: datalake.this.`package`.isJSON(df.apply("data")).equals(org.apache.spark.sql.functions.lit(false)) was false
at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at org.scalatest.FunSpec.newAssertionFailedException(FunSpec.scala:1626)
at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(DatalakeFunSpecTest.scala:29)
at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(DatalakeFunSpecTest.scala:23)
at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(DatalakeFunSpecTest.scala:23)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSpecLike$$anon$1.apply(FunSpecLike.scala:422)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSpec.withFixture(FunSpec.scala:1626)
at org.scalatest.FunSpecLike$class.invokeWithFixture$1(FunSpecLike.scala:419)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:431)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:431)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:431)
at com.mhedu.common.datalake.DatalakeFunSpecTest.org$scalatest$BeforeAndAfter$$super$runTest(DatalakeFunSpecTest.scala:13)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at com.mhedu.common.datalake.DatalakeFunSpecTest.runTest(DatalakeFunSpecTest.scala:13)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:464)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:464)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:390)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:427)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:464)
at org.scalatest.FunSpec.runTests(FunSpec.scala:1626)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1626)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:468)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:468)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:468)
at com.mhedu.common.datalake.DatalakeFunSpecTest.org$scalatest$BeforeAndAfter$$super$run(DatalakeFunSpecTest.scala:13)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at com.mhedu.common.datalake.DatalakeFunSpecTest.run(DatalakeFunSpecTest.scala:13)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
at org.scalatest.tools.Runner$.run(Runner.scala:883)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
Is there any way I can write assertions for Column
type or somehow extract raw value of column in Boolean and do the comparison?
You're testing for equality of two Column
instances; These instances aren't equal - they would produce the same result if applied to your DF, but they're not equal (it's easy to apply them both to a different DF and get different results).
One way of testing this would be to filter
the DataFrame with the condition of these two Column
s (the result of isJSON
and lit(true)
) being equal, and then assert that the size of the result is 0:
describe("isJSON()") {
it("should return false if data is not JSON") {
val df = Seq("Not a JSON").toDF( "data" )
assert(df.filter(isJSON(df("data")) === lit(true)).count() == 0)
}
}
Another option would be to collect the results of calculating this column, and asserting all results are false
, e.g.:
describe("isJSON()") {
it("should return false if data is not JSON") {
val df = Seq("Not a JSON").toDF( "data" )
val results: Array[Boolean] = df.select(isJSON(df("data"))).collect().map { case Row(b: Boolean) => b }
assert(results sameElements Array(false))
}
}
There are many other similar options, the important concept here is comparing data instead of Column
objects - as long as the compared types in the assert expression are columns, you're not comparing actual results.