Search code examples
scalaapache-sparkapache-spark-sqlrdduser-defined-functions

access scala map from dataframe without using UDFs


I have a Spark (version 1.6) Dataframe, and I would like to add a column with a value contained in a Scala Map, this is my simplified code:

val map = Map("VAL1" -> 1, "VAL2" -> 2)
val df2 = df.withColumn("newVal", map(col("key")))

This code doesn't work and obviously I receive the following error, because the map expecting a String value, while receiving a column:

found   : org.apache.spark.sql.Column
required: String

The only way I could do that is using an UDF:

val map = Map("VAL1" -> 1, "VAL2" -> 2)
val myUdf = udf{ value:String => map(value)}
val df2 = df.withColumn("newVal", myUdf($"key"))

I want avoid the use of UDFs if possible.

Are there any other solutions available using just the DataFrame API (I would like also to avoid transforming it to RDD)?


Solution

  • You could convert the Map to a Dataframe and use a JOIN between this and your existing dataframe. Since the Map dataframe would be very small, it should be a Broadcast Join and avoid the need for a shuffle phase.

    Letting Spark know to use a broadcast join is described in this answer: DataFrame join optimization - Broadcast Hash Join