scala apache-spark apache-spark-sql rdd user-defined-functions

access scala map from dataframe without using UDFs

I have a Spark (version 1.6) Dataframe, and I would like to add a column with a value contained in a Scala Map, this is my simplified code:

val map = Map("VAL1" -> 1, "VAL2" -> 2)
val df2 = df.withColumn("newVal", map(col("key")))

This code doesn't work and obviously I receive the following error, because the map expecting a String value, while receiving a column:

found   : org.apache.spark.sql.Column
required: String

The only way I could do that is using an UDF:

val map = Map("VAL1" -> 1, "VAL2" -> 2)
val myUdf = udf{ value:String => map(value)}
val df2 = df.withColumn("newVal", myUdf($"key"))

I want avoid the use of UDFs if possible.

Are there any other solutions available using just the DataFrame API (I would like also to avoid transforming it to RDD)?

Solution

You could convert the Map to a Dataframe and use a JOIN between this and your existing dataframe. Since the Map dataframe would be very small, it should be a Broadcast Join and avoid the need for a shuffle phase.

Letting Spark know to use a broadcast join is described in this answer: DataFrame join optimization - Broadcast Hash Join