Search code examples
apache-sparkpysparkapache-spark-ml

How to set parameters for a custom PySpark Transformer once it's a stage in a fitted ML Pipeline?


I've written a custom ML Pipeline Estimator and Transformer for my own Python algorithm by following the pattern shown here.

However, in that example all the parameters needed by _transform() were conveniently passed into the Model/Transformer by the estimator's _fit() method. But my transformer has several parameters that control the way the transform is applied. These parameters are specific to the transformer so it would feel odd to pass them into the estimator in advance along with the estimator-specific parameters used for fitting the model.

I can work around this by adding extra Params to the transformer. This works fine when I use my estimator and transformer outside of an ML Pipeline. But how can I set these transformer-specific parameters once my estimator object has been added as a stage to a Pipeline? For example, you can call getStages() on a pyspark.ml.pipeline.Pipeline and can therefore get the estimators, but there is no corresponding getStages() method on PipelineModel. I can't see any methods for setting parameters on the PipelineModel stages either.

So how can I set the parameters on my transformer before I call transform() on the fitted pipeline model? I'm on Spark 2.2.0.


Solution

  • There is no getStages() method on PipelineModel but the same class does have an undocumented member called stages.

    For example, if you've just fitted a pipeline model with 3 stages and you want to set some parameters on the second stage, you can just do something like:

    myModel = myPipelineModel.stages[1]
    myModel.setMyParam(42)
    # Or in one line:
    #myPipelineModel.stages[1].setMyParam(42)
    
    # Now we can push our data through the fully configured pipeline model:
    resultsDF = myPipelineModel.transform(inputDF)