I am working with spark pipelines and find myself often in a situation where I have a bunch of SQLTransformers that do different things in a pipeline and cant really understand what they do without looking at the entire statement.
I would like to add maybe some simple documentation or tag component to each transformer type(which will be persisted when the transformer is saved) and can be retrieved later if need be.
So basically something like this.
s = SQLTransformer()
s.tag = "basic target generation"
s.save("tmp")
s2 = SQLTransformer.load("tmp")
print(s2.tag)
or
s = SQLTransformer()
s.setParam(tag="basic target generation")
s.save("tmp")
s2 = SQLTransformer.load("tmp")
print(s2.getParam("tag"))
I can see that I cant do either right now because the param objects are locked down and I cant seem to modify the existing ones other than statement or add new ones. But is there anything I can do to get some functinality like this?
I am using Spark 2.1.1 with python.
Not without implementing your own Scala Transformer
extending SQLTransformer
and then writing Python interface (or writing standalone Python Transformer
- How to Roll a Custom Estimator in PySpark mllib).
However if you
would like to add maybe some simple documentation
you can just add comments to the statement:
s = SQLTransformer(statement = """
-- This is a transformer that selects everything
SELECT * FROM __THIS__""")
print(s.getStatement())
## -- This is a transformer that selects everything
## SELECT * FROM __THIS__