pyspark amazon-emr huggingface-transformers sentence-transformers

How to generate sentence embeddings with sentence transformers using pyspark in an optimized way?

I am trying to generate sentence embedding using hugging face sbert transformers. Currently, I am using all-MiniLM-L6-v2 pre-trained model to generate sentence embedding using pyspark on AWS EMR cluster. But seems like even after using udf (for distributing on different instances), model.encode() function is really slow. Is there a way to optimize this process to quick get embeddings on a very large dataset on a pyspark environment? The dataset has around 2M rows.

I have tried to get embeddings directly using model.encode function and for the distribution on different instances, I am using udf function which will broadcast model to different instances. Also, increasing the size of cluster doesn't help much. Any suggestions/links would be appreciated!

Solution

Even with the distributed computing and more CPUs, generating embeddings using sentence transformers is slow. There are p3 EC2 GPU instances that provides GPUs for large computation in parallel. Using GPUs and batch processing, I am able to generate sentence transformers embeddings efficiently. In my case, a single GPU ec2 instance is at least 8 times faster than CPU instances.

Batch processing is necessary to utilize GPU efficiently. Otherwise, it's same as to generate a single sentence embedding for a sentence at a time.