python apache-spark pyspark apache-spark-mllib apache-spark-ml

StringIndexerModel inputCol

I have a cluster with spark 2.1 and a process that at the end writes on file a PipelineModel, which contains a StringIndexerModel. I can locally (with spark 2.3) load the pipeline and inspect the StringIndexerModel. What it appears very strange is that the method and fields differ between the two versions, even if they read the same files. In particular, with spark 2.1 the field inputCol appears to not be there even if it's obviously needed to make the StringIndexer work.

This is what I get.

Spark 2.1:

pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#AttributeError: 'StringIndexerModel' object has no attribute 'inputCol'

Spark 2.3

pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Param(parent='StringIndexer_494eb1f86ababc8540e2', name='inputCol', doc='input column name')

I understand that methods and fields might change from one version to another, but the inputCol must be somewhere in the object, since it is essential to make fit or transform work. Is there a way to extract the inputCol in spark 2.1 with PySpark?

Solution

The heavy lifting in Spark ML is done by the internal Java objects (_java_obj), that's why objects can work, even if internal are never fully exposed in Python API. This of course limits what can be done without drilling into Java API, and since Spark 2.3 Params are exposed in PySpark models (SPARK-10931).

In previous versions you can access internal model, and fetch data from there. However if you want to get a value of the Param you should use get* method, not the Param as such.

si._java_obj.getInputCol()