Search code examples
scalaapache-sparkone-hot-encoding

How to change an array of integers to individual columns in Spark (scala)?


I have followed this solution for one hot encoding. Now I want the last variable in my array (which is an array of integers) to change so that I get individual columns for each one hot-encoded variable.

My current RDD is:

scala> encode_cars
res2: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Array[Int])] = MapPartitionsRDD[17] at map at <console>:27

and I ideally I would want something like:

res2: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Int, Int, Int, Int, Int, Int, Int)] = MapPartitionsRDD[17] at map at <console>:27

I know that this could be done using a map / flatmap but I am not sure how to do it.


Solution

  • I found an easy solution by just indexing the array and using the map function:

    encode_cars.map(x => (x._1, x._2, x._3, x._4, x._5(1), x._5(2), x._5(3))