Search code examples
apache-sparkpysparkapache-spark-ml

How to train SparkML gradient boosting classifer given a RDD


Given the following rdd

training_rdd = rdd.select(
    # Categorical features
    col('device_os'), # 'ios', 'android'

    # Numeric features
    col('30day_click_count'), 
    col('30day_impression_count'),
    np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),

    # label
    col('did_click').alias('label')
)

I am confused about the syntax to train a gradient boosting classifer.

I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier

However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.


Solution

  • You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.

    from pyspark.ml.feature import VectorAssembler
    vectorizer = VectorAssembler()
    
    vectorizer.setInputCols(["device_os",
                             "30day_click_count",
                             "30day_impression_count",
                             "30day_click_through_rate"])
    
    vectorizer.setOutputCol("features")
    

    And consequently, you will need to put vectorizer as the first stage into the Pipeline:

    pipeline = Pipeline([vectorizer, ...])