apache-spark user-defined-functions onnx

How to use ONNX models for inference in Spark

I have trained a model for text classification using huggingface/transformers, then I exported it using the built-in ONNX functionality.

Now, I'd like to use it for inference on millions of texts (around 100 millions of sentences). My idea is to put all the texts in a Spark DataFrame, then bundle the .onnx model into a Spark UDF, and run inference that way, on a Spark cluster.

Is there a better way of doing this? Am I doing things "the right way"?

Solution

I am not sure if you are aware of and/or allowed to use SynapseML, due to the requirements (cf. "SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+," as of today, per the landing page), but SynapseML does have support for ONNX Inference on Spark. This could probably be the cleanest solution for you.

EDIT. Also, MLflow has support for exporting a python_function model as an Apache Spark UDF. With MLflow, you save your model in, say, the ONNX format, log/register the model via mlflow.onnx.log_model, and later retrieve it in the mlflow.pyfunc.spark_udf call via its path, i.e., models:/<model-name>/<model-version>.