Search code examples
pythonapache-sparkpysparkapache-spark-ml

VectorAssembler creates string values instead of original integers


I have the following PySpark DataFrame df:

df.printSchema()


 |-- yearday: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- dayofweek: integer (nullable = true)
 |-- year: integer (nullable = true)

When I apply VectorAssembler, the features are converted into string values instead of original integer values.

from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols = ['yearday', 'month', 'dayofweek', 'year'], outputCol = 'features')
df = vectorAssembler.transform(df)
df.select(['features']).show()

This is how the output looks like:

enter image description here

How can I get integers in features?


Solution

  • I suspect it's a display bug... it should be an integer. Try the code below to confirm what type the vectors contain.

    from pyspark.ml.param import TypeConverters
    
    print(TypeConverters.toList(df.select('features').take(1)[0][0]))