I have the following code, which is used to (sha) hash columns in a spark dataframe:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{sha2,lit, col}
object hashing {
def process(hashFieldNames: List[String])(df: DataFrame) = {
hashFieldNames.foldLeft(df) { case (df, hashField) =>
df.withColumn(hashField, sha2(col(hashField), 256))
}
}
}
Now in a seperate file, I am testing my hashing.process
using a AnyWordSpec
Test as follows:
"The hashing .process " should {
// some cases here that complete succesfully
"fail to hash a spark dataframe due to type mismatch " in {
val goodColumns = Seq("language", "usersCount", "ID", "personalData")
val badDataSample =
Seq(
("Java", "20000", 2, "happy"),
("Python", "100000", 3, "happy"),
("Scala", "3000", 1, "jolly")
)
val badDf =
spark.sparkContext.parallelize(badDataSample).toDF(goodColumns: _*)
val thrown = intercept[org.apache.spark.sql.AnalysisException] {
val hashedResultDf =
hashing.process(hashFieldNames)(badDf)
}
assert (thrown.getMessage === // some lengthy error message that I do not want to copy paste in its entirety.
Usually, as I understand, one would want to hard code the whole error message to ensure that it is indeed as we expect. However, the message is very lengthy and I am wondering if there is no better approach.
Basically, I have two questions:
a.) Is it considered good practice to match only the beginning part of error message and then
follow up with a regex ? I am thinking something like this: thrown.getMessage === "[cannot resolve sha2(ID, 256) due to data type mismatch: argument 1 requires binary type, however, ID is of int type.;" + regexpattern \;(.*))
b.) If a.) is considered a hacky approach, do you have any working suggestion on how to do it properly ?
Note: Small errors possible with code above, I adapted it for SO post. But you should get the idea.
You should not be asserting exception messages (unless they are surfced to the user, or something downndstream relies on them). If throwing an exception is a part of contract, then you should be throwing one of a specific type with a given error code, and tests should be asserting that. And if it isn't, then who cares what the message said?