python huggingface-transformers amazon-sagemaker endpoint

Deploy AWS SageMaker endpoint for Hugging Face embedding model

I would like to deploy a huggingface text embedding model endpoint via aws sagemaker.

Here is my code so far:

import sagemaker
from sagemaker.huggingface.model import HuggingFaceModel

# sess = sagemaker.Session()
role = sagemaker.get_execution_role()

# Hub Model configuration. <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'sentence-transformers/all-MiniLM-L12-v2', # model_id from hf.co/models
  'HF_TASK':'feature-extraction' # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    env=hub, # configuration for loading model from Hub
    role=role, # iam role with permissions to create an Endpoint
    py_version='py36',
    transformers_version="4.6", # transformers version used
    pytorch_version="1.7", # pytorch version used
)

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)

data = {
"inputs": ["This is an example sentence", "Each sentence is converted"]
}

result = predictor.predict(data)
print(len(result[0]))
print(result[0])

While this does deploy a endpoint successfully, it does not behave the way it should. I expect for each string in the input list to get a 1x384 list of floats as output. But instead i get 7x384 lists for each sentence. Did I maybe use the wrong pipeline?

Solution

There are two ways to deploy HuggingFace Models as Sagemaker Endpoints:

The way you have done, defining env=hub inside HuggingFaceModel class. This is a nice and quick way to get inferences from the model without any custom preprocessing. You send a request and you will get a response in the raw form with which the model was created.
If you want to do more with each request sent to model i.e. preprocess the inputs, change the behaviour of model and/or postprocess the output, you will need to use custom scripts. Sample HuggingFaceModel class parameters are:

huggingface_model = HuggingFaceModel(
   model_data=s3_location,       # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.37.0",  # transformers version used
   pytorch_version="2.1.0",        # pytorch version used
   py_version='py310',            # python version used
    #model_server_workers=1,
    #image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.1.0-transformers4.37.0-cpu-py310-ubuntu22.04"
   #env=hub
)

This is the complete reference you need: https://github.com/huggingface/notebooks/blob/main/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb

Additional Info: The handler file that will run with each request your endpoint receives:https://github.com/aws/sagemaker-huggingface-inference-toolkit/blob/main/src/sagemaker_huggingface_inference_toolkit/handler_service.py