Search code examples
apache-sparkuser-defined-functionsapache-spark-sqludf

spark spelling correction via udf


I need to correct some spellings using spark. Unfortunately a naive approach like

val misspellings3 = misspellings1
    .withColumn("A", when('A === "error1", "replacement1").otherwise('A))
    .withColumn("A", when('A === "error1", "replacement1").otherwise('A))
    .withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))

does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?

The simple cases (the first 2 examples) can nicely be handled via

val spellingMistakes = Map(
    "error1" -> "fix1"
  )

  val spellingNameCorrection: (String => String) = (t: String) => {
    titles.get(t) match {
      case Some(tt) => tt // correct spelling
      case None => t // keep original
    }
  }
  val spellingUDF = udf(spellingNameCorrection)

  val misspellings1 = hiddenSeasonalities
    .withColumn("A", spellingUDF('A))

But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner. If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?


Solution

  • For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306