How to train SparkML gradient boosting classifer given a RDD

Given the following rdd

training_rdd = rdd.select(
    # Categorical features
    col('device_os'), # 'ios', 'android'

    # Numeric features
    col('30day_click_count'), 
    col('30day_impression_count'),
    np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),

    # label
    col('did_click').alias('label')
)

I am confused about the syntax to train a gradient boosting classifer.

I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier

However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.

Solution

You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.

from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()

vectorizer.setInputCols(["device_os",
                         "30day_click_count",
                         "30day_impression_count",
                         "30day_click_through_rate"])

vectorizer.setOutputCol("features")

And consequently, you will need to put vectorizer as the first stage into the Pipeline:

pipeline = Pipeline([vectorizer, ...])