amazon-web-services artificial-intelligence amazon-sagemaker large-language-model huggingface

Invocation of a Huggingface Summarization Model using AWS Servereless Sagemaker Endpoint

I am trying to run an AWS Serverless SageMaker Endpoint with the huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization model for simple text summarization.

This is my AWS CloudFormation template:

SageMakerModel:
  Type: AWS::SageMaker::Model
  Properties:
    ModelName: SummarizationModel
    Containers:
      - Image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-cpu-py39-ubuntu20.04"
        ModelDataUrl: "s3://jumpstart-cache-prod-us-east-1/huggingface-infer/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization.tar.gz"
        Mode: SingleModel
    ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn


SageMakerEndpointConfig:
  Type: "AWS::SageMaker::EndpointConfig"
  Properties:
    ProductionVariants:
      - ModelName: !GetAtt SageMakerModel.ModelName
        VariantName: "ServerlessVariant"
        ServerlessConfig: 
          MaxConcurrency: 1
          MemorySizeInMB: 2048

SageMakerEndpoint:
  Type: "AWS::SageMaker::Endpoint"
  Properties:
    EndpointName: SummarizationEndpoint
    EndpointConfigName:
      !GetAtt SageMakerEndpointConfig.EndpointConfigName

The model is deployed successfully as far as I can tell.

I have deployed a Python lambda function to invoke the endpoint. This is my code:

client = boto3.client('runtime.sagemaker')
payload = {
  'inputs': 'Summarize this text: This is a beautiful day. I am happy. I am going to the park.'
}

response = client.invoke_endpoint(
        EndpointName="SummarizationEndpoint", 
        ContentType="application/json", 
        Accept="application/json",
        Body=json.dumps(payload)
        # Body=bytes(json.dumps(payload), 'utf-8') # alternative attempt - not working
        # Body=json.dumps(payload).encode("utf-8") # alternative attempt - not working
    )

When I run this code I get the following error:

An error occurred: An error occurred (ModelError) when calling the InvokeEndpoint operation: 
Received client error (400) from model with message "
{ 
  "code": 400, 
  "type": "InternalServerException", 
   "message": "\u0027str\u0027 object is not callable"
}".

Since this is a ModelError I am assuming the model is deployed and the inference pipeline is being called. I am unsure about the payload format though. Judging from the test code here I am guessing that the text-to-be-summarized should be passed in the inputs property of the payload like it is being done here. Looking at the SummarizationPipeline though, I don't quite understand the comments here - should there be a documents property somewhere? I played with all possible combinations of inputs, documents, etc but without success.

What is the correct way to pass the payload to the model? Can I see a working example?

Update 1: These are the logs from CloudWatch when I use the version payload = {'inputs':'...'}:

[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1084, in __call__
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
[INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 234, in handle
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     response = self.transform_fn(self.model, input_data, content_type, accept)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 190, in transform_fn
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     predictions = self.predict(processed_data, model)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 158, in predict
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     prediction = model(inputs)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py", line 165, in __call__
[INFO ] W-9000-model ACCESS_LOG - /127.0.0.1:48184 "POST /invocations HTTP/1.1" 400 3416
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     result = super().__call__(*args, **kwargs)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1084, in __call__
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1090, in run_single
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     model_inputs = self.preprocess(inputs, **preprocess_params)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py", line 175, in preprocess
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     inputs = self._parse_and_tokenize(inputs, truncation=truncation, **kwargs)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py", line 130, in _parse_and_tokenize
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     inputs = self.tokenizer(*args, padding=padding, truncation=truncation, return_tensors=self.framework)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - TypeError: 'str' object is not callable

I looked into the code of handler_service.py. As the line 158 is executed, it means this payload successfully passes line 151

        inputs = data.pop("inputs", data)

... which confirms that inputs must be the property name.

However, looking further into the stacktrace I couldn't find anything interesting. The inputs are being passed to the tokenizer and this is where my stacktrace ends.

Update 2: I noticed that the same invocation code works with another model. Here's the model yaml that does work:

SageMakerModel2:
  Type: AWS::SageMaker::Model
  Properties:
    ModelName: SummarizationModel2
    Containers:
      - Image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.6.1-cpu-py36-ubuntu18.04"
        ModelDataUrl: "s3://jumpstart-cache-prod-us-east-1/huggingface-infer/infer-huggingface-translation-t5-small.tar.gz"
        Mode: SingleModel
    ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn

After further analysis I learned that the multi-model-server calls handler_service.initialize when loading the model to create a pipeline using the pipeline() function.

I then downloaded both models and tried to instantiate a pipeline on my machine from both models to see what happens to the tokenizer. Here is the code...

# p1 model is not working
p1 = pipeline("summarization", "/REDACTED/Code/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization")

# p2 model is working
p2 = pipeline("text2text-generation", "/REDACTED/Code/infer-huggingface-translation-t5-small/")
print("Tokenizer for P1: " + str(type(p1.tokenizer)))
print("Tokenizer for P2: " + str(type(p2.tokenizer)))

The code proves that p1.tokenizer is of NoneType whereas p2.tokenizer is of class 'transformers.models.t5.tokenization_t5_fast.T5TokenizerFast'.

After further investigating the code of pipeline() function I found that in this line ...

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None

... load_tokenizer is set to False for p1 because type(model_config) is not found in TOKENIZER_MAPPING whereas load_tokenizer is True for p2 because it was found. (See here and here). I am not sure though if this finding is relevant as the model_fn() function in the model's inference.py does create a tokenizer by using tokenizer = AutoTokenizer.from_pretrained(model_dir) and then passes it to SummarizationPipeline. I tired to create a tokenizer this way locally ...

tokenizer = AutoTokenizer.from_pretrained("/REDACTED/Code/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization")

print(str(type(tokenizer)))

... and I do get an instance of type transformers.models.bert.tokenization_bert_fast.BertTokenizerFast.

(I have to admit that I did not fully grasp everything that's going on here but continuing investigation...)

Solution

TLDR: The problem was that my model tar.gz file (i.e. s3://jumpstart-cache-prod-us-east-1/huggingface-infer/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization.tar.gz) is missing the custom inference.py script, so the sagemaker_huggingface_inferencen_toolkit used the default load() function to load the module which failed to create an instance of the tokenizer and therefor failed calling the tokenizer, causing the error str object is not callable.

Here is what I did to fix the problem:

I downloaded the model tar.gz file from S3
I looked up the location of the corresponding sourcedir.tar.gz file by using a call to sagemaker.script_uris (see here) and downloaded the sourcedir.tar.gz.
I unpacked both, the model tar.gz and the sourcedir.tar.gz and moved the code from sourcedir into a sub-directory named code inside the model directory.
I packed the model directory (now including a code subdirectory with inference.py) into a tar.gz and uploaded it into an S3 bucket.
I used the URL to this new S3 bucket as my ModelDataURL.

More details: Here are a few more details I discovered while debugging the issue. Both models tar.gz i used did not contain inference.py but one worked and the other did not. I tried to understand the call stack and why one failed and the other did not. I am sharing it here if you are a beginner (like me) and want to understand more what's going on.

The docker image specified in the SageMaker Image attribute uses the Python script specified in its ENTRYPOINT env var to start the multi-model-server. It does so by calling sagemaker_huggingface_inference_toolkit.mms_model_server.start_model_server(). This function passes the handler_service to be used to the command line that starts the mms server. The mms server is installed in the docker image (/opt/conda/bin/multi-model-server).
The MMS loads the sagemaker_huggingface_inference_toolkit.handler_service and calls its constructor __init__() function and then the initialize() function.

The constructor uses sagemaker_inference.Environment() to set environment.module_name to default value inference.py unless otherwise specified in the SAGEMAKER_PROGRAM parameter. It also adds the code_dir (i.e. the code subdirectory of the module-path) to the PYTHON_PATH_ENV - which is important later.
The initialize() function passes the model-path argument from the command line call to multi-model-server (see above, i.e. "/opt/ml/model") as the model_dir variable in the context.system_properties. Then initialize() calls self.validate_and_initialize_user_module() which looks up the module_name in the loaded sage_maker.environment. If the module_name is specified (e.g. "inference.py") and can be found using importlib.utils.find_spec (therefore code_dir was added to PYTHON_PATH_ENV in the constructor) then it is imported and overrides the default module loader handler_service.load() function, the default transformer handler_service_transform_fn() and other handler_service functions with the ones specified in inference.py. See here and more background here.

Assuming the handler_service.load() function is NOT overridden by the module_fn function declared in inference.py (which was the case for me): The default load() function calls transform_utils.get_pipeline(task=task, model_dir=model_dir, device=self.device). If task is already defined in the env var HF_TASK it is passed to get_pipeline(). Otherwise, task is loaded from the config.json file in the model_dir ("/opt/ml/model") by inspecting the architecture property and matching it against a dictionary. E.g., if the architecture is ending with EncoderDecoderModel then it is a text2text-generation task. In transform_utils.get_pipeline() function kwargs[tokenizer] is set to model_dir because task is "text2text-generation". Then the pipeline is created with the line of code (Note: pipeline is defined in transformers.pipelines.__init__.py.)

hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)

The function pipeline() tries to load the tokenizer by inspecting the model configuration from "/opt/ml/model/config.json". It creates an instance of the right configuration class for the model type by using the model_type property in config.json. E.g. for a t5 model type, it creates an instance of T5Config and for an encoder-decoder model type it creates an instance of EncoderDecoderConfig. It then checks what tokenizer to load by looking up the name of the tokenizer in the TOKENIZER_MAPPING_NAMES dict. It looks up the tokenizer by using the model_type from the config.json (there's a bit more magic in the lookup but that's essentially what it does) - see here. Now here is the key point: There is a tokenizer specified for t5 but there is no tokenizer specified for the encoder-decoder model type. Hence, load_tokenizer is False for the encoder-decoder model, and tokenizer remains a string (i.e. the module_dir path) in pipeline() instead of being actually loaded with a tokenizer. This is why my t5-based model infer-huggingface-translation-t5-small successfully created a tokenizer but the encoder-decoder-based model infer-...-mail-summarization did not. To finish call stack...
pipeline() infers the actual pipeline class from the task, e.g. for a text2text-generation task it returns an instance of Text2TextGenerationPipeline.
When I used the Pipeline of the encode-decoder model type (either by calling the model through the handler_service.predict() handler or the handle() handler ), the model called preprocess() and then Text2TextGenerationPipeline._parse_and_tokenize(). Inside text2text-generation._parse_and_tokenize() the tokenizer was still a string (see 4) and the call self.tokenizer(...) failed because tokenizer it cannot be called as a function.

So, this is what was going on and why the lack of code/inference.py in the model's tar.gz file caused the error and I wrongly assumed it was due to the payload format.