google-cloud-dataflow apache-beam huggingface-transformers sentence-transformers

Use sentence transformers models with Apache Beam

I have an apache beam pipeline that works flawlessly with a DirectRunner, but not with a DataflowRunner :

When using a DataflowRunner I get a "Error 413 (Request entity too large)" From what I understand, it is because the pipeline file is too large. (I get it with the following option : --dataflow_job_file=gs://... And this is caused by the model I use :

embeding_model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2')

Have anyone already experimented something similar ?

Solution

You are correct in the assumption that the pipeline file is too large--the direct runner doesn't have that limitation but I believe Dataflow limits the JSON to something like 20mb.

I'm guessing you're embedding the model into that JSON? You'll probably be better off loading it from an external source. For example, RunInference in the Python SDK allows for loading custom models.