apache-spark pyspark pipeline apache-spark-ml

Posibility of setting custom param on Spark Transformer

I am working with spark pipelines and find myself often in a situation where I have a bunch of SQLTransformers that do different things in a pipeline and cant really understand what they do without looking at the entire statement.

I would like to add maybe some simple documentation or tag component to each transformer type(which will be persisted when the transformer is saved) and can be retrieved later if need be.

So basically something like this.

s = SQLTransformer()
s.tag = "basic target generation"
s.save("tmp")

s2 = SQLTransformer.load("tmp")
print(s2.tag)

s = SQLTransformer()
s.setParam(tag="basic target generation")
s.save("tmp")

s2 = SQLTransformer.load("tmp")
print(s2.getParam("tag"))

I can see that I cant do either right now because the param objects are locked down and I cant seem to modify the existing ones other than statement or add new ones. But is there anything I can do to get some functinality like this?

I am using Spark 2.1.1 with python.

Solution

Not without implementing your own Scala Transformer extending SQLTransformer and then writing Python interface (or writing standalone Python Transformer - How to Roll a Custom Estimator in PySpark mllib).

However if you

would like to add maybe some simple documentation

you can just add comments to the statement:

s = SQLTransformer(statement = """
    -- This is a transformer that selects everything
    SELECT * FROM __THIS__""")

print(s.getStatement())

##    -- This is a transformer that selects everything
##    SELECT * FROM __THIS__