I have a cluster with spark 2.1 and a process that at the end writes on file a PipelineModel
, which contains a StringIndexerModel
. I can locally (with spark 2.3) load the pipeline and inspect the StringIndexerModel
. What it appears very strange is that the method and fields differ between the two versions, even if they read the same files. In particular, with spark 2.1 the field inputCol
appears to not be there even if it's obviously needed to make the StringIndexer work.
This is what I get.
Spark 2.1:
pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
#AttributeError: 'StringIndexerModel' object has no attribute 'inputCol'
Spark 2.3
pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Param(parent='StringIndexer_494eb1f86ababc8540e2', name='inputCol', doc='input column name')
I understand that methods and fields might change from one version to another, but the inputCol
must be somewhere in the object, since it is essential to make fit
or transform
work. Is there a way to extract the inputCol
in spark 2.1 with PySpark?
The heavy lifting in Spark ML is done by the internal Java objects (_java_obj
), that's why objects can work, even if internal are never fully exposed in Python API. This of course limits what can be done without drilling into Java API, and since Spark 2.3 Params
are exposed in PySpark models (SPARK-10931).
In previous versions you can access internal model, and fetch data from there. However if you want to get a value of the Param
you should use get*
method, not the Param
as such.
si._java_obj.getInputCol()
Related: