Search code examples
pysparkapache-spark-mllib

Type Error in Gradient Boosted Trees from mllib


I try to run a Gradient Boosted Tree Algorithm on some data with mixed types:

[('feature1', 'bigint'),
 ('feature2', 'int'),
 ('label', 'double')]

I tried the following

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint

vectorAssembler = VectorAssembler(inputCols = ["feature1", "feature2"], outputCol = "features")

data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(['features', 'label'])
data_assembled = data_assembled.select(F.col("features"), F.col("label"))\
  .rdd\
  .map(lambda row: LabeledPoint(MLLibVectors.fromML(row.label), MLLibVectors.fromML(row.features)))

(trainingData, testData) = data_assembled.randomSplit([0.9, 0.1])

model = GradientBoostedTrees.trainRegressor(trainingData,
                                            categoricalFeaturesInfo={}, numIterations=100)

However I get the following error:

TypeError: Unsupported vector type <class 'float'>

But none of my types is actually float. Also, feature2 is binary if that is relevant.


Solution

  • I ended up avoiding the mllib implementation and going with Spark ML instead:

    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.regression import GBTRegressor
    
    vectorAssembler = VectorAssembler(inputCols = ["feature1", "feature2"], outputCol = "features")
    
    data_assembled = vectorAssembler.transform(data)
    data_assembled = data_assembled.select(F.col("label"), F.col("features"))
    
    (trainingData, testData) = data_assembled.randomSplit([0.7, 0.3])
    
    gbt_model = GBTRegressor(featuresCol="features", maxIter=10).fit(trainingData)
    

    Python does not have the required double type for LabeledPoint objects so I assume the mapping from pyspark results in the conversion to float.