Search code examples
pysparkapache-spark-sqlfeature-extractionapache-spark-ml

Bad formation of VectorAssembler giving unwanted values into features


I've used VectorAssembler many times which worked well. But today I got unwanted data into features as shown in below in figure.

Input is 4 features without NaN which are from pySpark data frame.

assembler = VectorAssembler(inputCols = descritif.columns, outputCol = 'features')
pcaFeatures = assembler.transform(descritif).select('features')
pcaFeatures.show(truncate=False)

Why I've got (5,[0,1] before every rows in features column, is this normal? Does it affect learning? vector


Solution

  • To come up to my problem after two day I found solution. There two post in stack which did not give effective solution.

    1 - First Aplied this udf function to convert data.

    function

    from pyspark.sql import functions as F
    from pyspark.sql import types as T
    from pyspark.ml.linalg import SparseVector, DenseVector
    
    def sparse_to_array(v):
      v = DenseVector(v)
      new_array = list([float(x) for x in v])
      return new_array
    
    sparse_to_array_udf = F.udf(sparse_to_array, T.ArrayType(T.FloatType()))
    

    2 - Then apply it to the data.
    # convert

    df = pcaFeatures.withColumn('features_array', sparse_to_array_udf('features'))
    

    Then If you want to convert this matrix to Vector pleas visit this website. Convert in Vector because after this step you can end up with a sparse matrix not vector then you'll get this error (below) on PCA or other while fitting/transform the data.

    IllegalArgumentException: 'requirement failed: Column pcaFeatures_Norm must be of type struct,values:array> but was actually array.'