Search code examples
pyspark

pyspark: access parameters of a saved pipeline model


I have a pyspark pipeline model saved on HDFS like:

stages = feature_stage_list \
            + label_stage_list \
            + assembler_stage_list \
            + classifier_list \
            + label_converter_stage_list
pipeline = Pipeline(stages = stages)
pipeline_model.save(path)

where classifier is a LogisticRegression model. I would like to be able to load the same model in a separate spark job and access the parameters of it. The following is used to load the model:

pipeline_model = PipelineModel.load(path)

now when I try to get the model stages or parameters it returns nothing. I have tried the following so far:

pipeline_model.params pipeline_model.explainParams()

printing the type of loaded object:

<class 'pyspark.ml.pipeline.PipelineModel'>

I'm surprised to see that those return empty as we are also saving the same model using mlflow and I could see both stages of the same model as well as parameters when I look into model artifacts. Am I missing something here?


Solution

  • For those who might want to do the same, I ended up doing the following to get the parameters:

    pipeline_model.stages[-2].getMaxIter()
    pipeline_model.stages[-2].getElasticNetParam()
    pipeline_model.stages[-2].getFamily()
    ...
    

    Not that according to the stages shown in the original post, the classifier_list is second to last and hence it is access by pipeline_model.stages[-2] so make sure to update according to your own pipeline order of stages.