python apache-spark pyspark apache-spark-ml

What data type does VectorAssembler require for an input?

The Core problem is this here

from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()

with error IllegalArgumentException: Data type array<bigint> of column a is not supported.

I know this is a bit of a toy problem, but I'm trying to integrate this into a longer pipeline with steps

StringIndexer
OneHotEncoding
Custom UnaryTransformer to multiply all the 1's by 10
- What datatype should be returned here?
Then VectorAssembler to combine the vectors into a single vector for modeling.

If I can determine the proper input datatype for the VectorAssembler I should be able to string everything together properly. I think the input type is a Vector, but I can't figure out how to build one.

Solution

According to the docs,

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

So you need to convert your array column to a vector column first (method from this question).

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
df_with_vectors = df.withColumn('a', list_to_vector_udf('a'))

Then you can use vector assembler:

vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])

vecAssembler.transform(df_with_vectors).show(truncate=False)
+-------------+---+---+---------------------+
|a            |b  |c  |features             |
+-------------+---+---+---------------------+
|[1.0,2.0,3.0]|0  |3  |[1.0,2.0,3.0,0.0,3.0]|
+-------------+---+---+---------------------+