pyspark apache-spark-sql feature-extraction apache-spark-ml

Bad formation of VectorAssembler giving unwanted values into features

I've used VectorAssembler many times which worked well. But today I got unwanted data into features as shown in below in figure.

Input is 4 features without NaN which are from pySpark data frame.

assembler = VectorAssembler(inputCols = descritif.columns, outputCol = 'features')
pcaFeatures = assembler.transform(descritif).select('features')
pcaFeatures.show(truncate=False)

Why I've got (5,[0,1] before every rows in features column, is this normal? Does it affect learning?

Solution

To come up to my problem after two day I found solution. There two post in stack which did not give effective solution.

1 - First Aplied this udf function to convert data.

function

from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.ml.linalg import SparseVector, DenseVector

def sparse_to_array(v):
  v = DenseVector(v)
  new_array = list([float(x) for x in v])
  return new_array

sparse_to_array_udf = F.udf(sparse_to_array, T.ArrayType(T.FloatType()))

2 - Then apply it to the data.
# convert

df = pcaFeatures.withColumn('features_array', sparse_to_array_udf('features'))

Then If you want to convert this matrix to Vector pleas visit this website. Convert in Vector because after this step you can end up with a sparse matrix not vector then you'll get this error (below) on PCA or other while fitting/transform the data.

IllegalArgumentException: 'requirement failed: Column pcaFeatures_Norm must be of type struct,values:array> but was actually array.'