I've used VectorAssembler
many times which worked well. But today I got unwanted data into features as shown in below in figure.
Input is 4 features without NaN which are from pySpark data frame.
assembler = VectorAssembler(inputCols = descritif.columns, outputCol = 'features')
pcaFeatures = assembler.transform(descritif).select('features')
pcaFeatures.show(truncate=False)
Why I've got (5,[0,1]
before every rows in features column, is this normal?
Does it affect learning?
To come up to my problem after two day I found solution. There two post in stack which did not give effective solution.
1 - First Aplied this udf function to convert data.
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.ml.linalg import SparseVector, DenseVector
def sparse_to_array(v):
v = DenseVector(v)
new_array = list([float(x) for x in v])
return new_array
sparse_to_array_udf = F.udf(sparse_to_array, T.ArrayType(T.FloatType()))
2 - Then apply it to the data.
# convert
df = pcaFeatures.withColumn('features_array', sparse_to_array_udf('features'))
Then If you want to convert this matrix to Vector pleas visit this website. Convert in Vector because after this step you can end up with a sparse matrix not vector then you'll get this error (below) on PCA or other while fitting/transform the data.
IllegalArgumentException: 'requirement failed: Column pcaFeatures_Norm must be of type struct,values:array> but was actually array.'