Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-ml

How convert Spark dataframe column from Array[Int] to linalg.Vector?


I have a dataframe, df, that looks like this:

+--------+--------------------+
| user_id|        is_following|
+--------+--------------------+
|       1|[2, 3, 4, 5, 6, 7]  |
|       2|[20, 30, 40, 50]    |
+--------+--------------------+

I can confirm this has the schema:

root
 |-- user_id: integer (nullable = true)
 |-- is_following: array (nullable = true)
 |    |-- element: integer (containsNull = true)

I would like to use Spark's ML routines such as LDA to do some machine learning on this, requiring me to convert the is_following column to a linalg.Vector (not a Scala vector). When I try to do this via

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)

I then get the following error:

java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.

If I am interpreting that correctly, I take away from it that I need to convert types here from integer to something else. (Double? String?)

My question is what is the best way to convert this array to something that will properly vectorize for the ML pipeline?

EDIT: If it helps, I don't have to structure the dataframe this way. I could instead have it be:

+--------+------------+
| user_id|is_following|
+--------+------------+
|       1|           2|
|       1|           3|
|       1|           4|
|       1|           5|
|       1|           6|
|       1|           7|
|       2|          20|
|     ...|         ...|
+--------+------------+

Solution

  • So your initial input might be better suited than your transformed input. Spark's VectorAssembler requires that all of the columns be Doubles, and not array's of doubles. Since different users could follow different numbers of people your current structure could be good, you just need to convert the is_following into a Double, you could infact do this with Spark's VectorIndexer https://spark.apache.org/docs/2.1.0/ml-features.html#vectorindexer or just manually do it in SQL.

    So the tl;dr is - the type error is because Spark's Vector's only support Doubles (this is changing likely for image data in the not so distant future but isn't a good fit for your use case anyways) and you're alternative structure might actually be better suited (the one without the grouping).

    You might find looking at the collaborative filtering example in the Spark documentation useful on your further adventure - https://spark.apache.org/docs/latest/ml-collaborative-filtering.html . Good luck and have fun with Spark ML :)

    edit:

    I noticed you said you're looking to do LDA on the inputs so let's also look at how to prepare the data for that format. For LDA input you might want to consider using the CountVectorizer (see https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer)