Search code examples
apache-sparkapache-spark-mllibpmml

How does a Spark dependency less Model export works?


Could anyone please explain in simple language how does a Spark model export works which is NOT dependent on the Spark cluster during predictions?

I mean, if we are using Spark functions like ml.feature.stopwordremover in training in ML pipeline and export it in say, PMML format, how does this function gets regenerated when deployed in production where I don't have a Spark installation. May be when we use JPMML. I went through the PMML wiki page here but it simply explains the structure of PMML. However no functional description is provided there.

Any good links to articles are welcome.


Solution

  • Please experiment with the JPMML-SparkML library (or its PySpark2PMML or Sparklyr2PMML frontends) to see how exactly are different Apache Spark transformers and models mapped to the PMML standard.

    For example, the PMML standard does not provide a specialized "remove stopwords" element. Instead, all low-level text manipulation is handled using generic TextIndex and TextIndexNormalization elements. The removal of stopwords is expressed/implemented as a regex transformation where they are simply replaced with empty strings. To evaluate such PMML documents, your runtime must only provide basic regex capabilities - there is absolutely no need for Apache Spark runtime or its transformer and model algorithms/classes.

    The translation from Apache Spark ML to PMML works surprisingly well (eg. much better coverage than with other translation approaches such as MLeap).