I am trying to run an AWS Serverless SageMaker Endpoint with the huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization
model for simple text summarization.
This is my AWS CloudFormation template:
SageMakerModel:
Type: AWS::SageMaker::Model
Properties:
ModelName: SummarizationModel
Containers:
- Image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-cpu-py39-ubuntu20.04"
ModelDataUrl: "s3://jumpstart-cache-prod-us-east-1/huggingface-infer/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization.tar.gz"
Mode: SingleModel
ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn
SageMakerEndpointConfig:
Type: "AWS::SageMaker::EndpointConfig"
Properties:
ProductionVariants:
- ModelName: !GetAtt SageMakerModel.ModelName
VariantName: "ServerlessVariant"
ServerlessConfig:
MaxConcurrency: 1
MemorySizeInMB: 2048
SageMakerEndpoint:
Type: "AWS::SageMaker::Endpoint"
Properties:
EndpointName: SummarizationEndpoint
EndpointConfigName:
!GetAtt SageMakerEndpointConfig.EndpointConfigName
The model is deployed successfully as far as I can tell.
I have deployed a Python lambda function to invoke the endpoint. This is my code:
client = boto3.client('runtime.sagemaker')
payload = {
'inputs': 'Summarize this text: This is a beautiful day. I am happy. I am going to the park.'
}
response = client.invoke_endpoint(
EndpointName="SummarizationEndpoint",
ContentType="application/json",
Accept="application/json",
Body=json.dumps(payload)
# Body=bytes(json.dumps(payload), 'utf-8') # alternative attempt - not working
# Body=json.dumps(payload).encode("utf-8") # alternative attempt - not working
)
When I run this code I get the following error:
An error occurred: An error occurred (ModelError) when calling the InvokeEndpoint operation:
Received client error (400) from model with message "
{
"code": 400,
"type": "InternalServerException",
"message": "\u0027str\u0027 object is not callable"
}".
Since this is a ModelError
I am assuming the model is deployed and the inference pipeline is being called. I am unsure about the payload format though.
Judging from the test code here I am guessing that the text-to-be-summarized should be passed in the inputs
property of the payload like it is being done here.
Looking at the SummarizationPipeline
though, I don't quite understand the comments here - should there be a documents
property somewhere? I played with all possible combinations of inputs
, documents
, etc but without success.
What is the correct way to pass the payload to the model? Can I see a working example?
Update 1: These are the logs from CloudWatch when I use the version payload = {'inputs':'...'}
:
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1084, in __call__
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
[INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 234, in handle
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - response = self.transform_fn(self.model, input_data, content_type, accept)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 190, in transform_fn
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - predictions = self.predict(processed_data, model)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 158, in predict
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - prediction = model(inputs)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py", line 165, in __call__
[INFO ] W-9000-model ACCESS_LOG - /127.0.0.1:48184 "POST /invocations HTTP/1.1" 400 3416
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - result = super().__call__(*args, **kwargs)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1084, in __call__
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1090, in run_single
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_inputs = self.preprocess(inputs, **preprocess_params)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py", line 175, in preprocess
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - inputs = self._parse_and_tokenize(inputs, truncation=truncation, **kwargs)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/text2text_generation.py", line 130, in _parse_and_tokenize
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - inputs = self.tokenizer(*args, padding=padding, truncation=truncation, return_tensors=self.framework)
[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - TypeError: 'str' object is not callable
I looked into the code of handler_service.py
. As the line 158 is executed, it means this payload successfully passes line 151
inputs = data.pop("inputs", data)
... which confirms that inputs
must be the property name.
However, looking further into the stacktrace I couldn't find anything interesting. The inputs are being passed to the tokenizer
and this is where my stacktrace ends.
Update 2: I noticed that the same invocation code works with another model. Here's the model yaml that does work:
SageMakerModel2:
Type: AWS::SageMaker::Model
Properties:
ModelName: SummarizationModel2
Containers:
- Image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.6.1-cpu-py36-ubuntu18.04"
ModelDataUrl: "s3://jumpstart-cache-prod-us-east-1/huggingface-infer/infer-huggingface-translation-t5-small.tar.gz"
Mode: SingleModel
ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn
After further analysis I learned that the multi-model-server calls handler_service.initialize
when loading the model to create a pipeline using the pipeline()
function.
I then downloaded both models and tried to instantiate a pipeline on my machine from both models to see what happens to the tokenizer. Here is the code...
# p1 model is not working
p1 = pipeline("summarization", "/REDACTED/Code/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization")
# p2 model is working
p2 = pipeline("text2text-generation", "/REDACTED/Code/infer-huggingface-translation-t5-small/")
print("Tokenizer for P1: " + str(type(p1.tokenizer)))
print("Tokenizer for P2: " + str(type(p2.tokenizer)))
The code proves that p1.tokenizer
is of NoneType
whereas p2.tokenizer
is of class 'transformers.models.t5.tokenization_t5_fast.T5TokenizerFast'
.
After further investigating the code of pipeline()
function I found that in this line ...
load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
... load_tokenizer
is set to False
for p1
because
type(model_config)
is not found in TOKENIZER_MAPPING
whereas load_tokenizer is True
for p2
because it was found. (See here and here).
I am not sure though if this finding is relevant as the model_fn()
function in the model's inference.py
does create a tokenizer
by using tokenizer = AutoTokenizer.from_pretrained(model_dir)
and then passes it to SummarizationPipeline
. I tired to create a tokenizer this way locally ...
tokenizer = AutoTokenizer.from_pretrained("/REDACTED/Code/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization")
print(str(type(tokenizer)))
... and I do get an instance of type transformers.models.bert.tokenization_bert_fast.BertTokenizerFast
.
(I have to admit that I did not fully grasp everything that's going on here but continuing investigation...)
TLDR:
The problem was that my model tar.gz file (i.e. s3://jumpstart-cache-prod-us-east-1/huggingface-infer/infer-huggingface-summarization-bert-small2bert-small-finetuned-cnn-daily-mail-summarization.tar.gz) is missing the custom inference.py
script, so the sagemaker_huggingface_inferencen_toolkit
used the default load()
function to load the module which failed to create an instance of the tokenizer and therefor failed calling the tokenizer, causing the error str object is not callable
.
Here is what I did to fix the problem:
sourcedir.tar.gz
file by using a call to sagemaker.script_uris
(see here) and downloaded the sourcedir.tar.gz
.sourcedir.tar.gz
and moved the code from sourcedir into a sub-directory named code
inside the model directory.code
subdirectory with inference.py
) into a tar.gz and uploaded it into an S3 bucket.ModelDataURL
.More details:
Here are a few more details I discovered while debugging the issue. Both models tar.gz i used did not contain inference.py
but one worked and the other did not. I tried to understand the call stack and why one failed and the other did not. I am sharing it here if you are a beginner (like me) and want to understand more what's going on.
Image
attribute uses the Python script specified in its ENTRYPOINT env var to start the multi-model-server
. It does so by calling sagemaker_huggingface_inference_toolkit.mms_model_server.start_model_server()
. This function passes the handler_service
to be used to the command line that starts the mms server. The mms server is installed in the docker image (/opt/conda/bin/multi-model-server).sagemaker_huggingface_inference_toolkit.handler_service
and calls its constructor __init__()
function and then the initialize()
function.sagemaker_inference.Environment()
to set environment.module_name
to default value inference.py
unless otherwise specified in the SAGEMAKER_PROGRAM
parameter. It also adds the code_dir
(i.e. the code
subdirectory of the module-path
) to the PYTHON_PATH_ENV
- which is important later.initialize()
function passes the model-path
argument from the command line call to multi-model-server
(see above, i.e. "/opt/ml/model") as the model_dir
variable in the context.system_properties
. Then initialize()
calls self.validate_and_initialize_user_module()
which looks up the module_name
in the loaded sage_maker.environment
. If the module_name
is specified (e.g. "inference.py"
) and can be found using importlib.utils.find_spec
(therefore code_dir
was added to PYTHON_PATH_ENV
in the constructor) then it is imported and overrides the default module loader handler_service.load()
function, the default transformer handler_service_transform_fn()
and other handler_service functions with the ones specified in inference.py
. See here and more background here.inference.py
(which was the case for me): The default load()
function calls transform_utils.get_pipeline(task=task, model_dir=model_dir, device=self.device)
. If task
is already defined in the env var HF_TASK it is passed to get_pipeline()
. Otherwise, task
is loaded from the config.json
file in the model_dir
("/opt/ml/model") by inspecting the architecture
property and matching it against a dictionary. E.g., if the architecture
is ending with EncoderDecoderModel
then it is a text2text-generation task. In transform_utils.get_pipeline()
function kwargs[tokenizer]
is set to model_dir
because task is "text2text-generation". Then the pipeline is created with the line of code
(Note: pipeline
is defined in transformers.pipelines.__init__.py
.)hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
pipeline()
tries to load the tokenizer by inspecting the model configuration from "/opt/ml/model/config.json"
. It creates an instance of the right configuration class for the model type by using the model_type
property in config.json
. E.g. for a t5
model type, it creates an instance of T5Config
and for an encoder-decoder
model type it creates an instance of EncoderDecoderConfig
. It then checks what tokenizer to load by looking up the name of the tokenizer in the TOKENIZER_MAPPING_NAMES
dict. It looks up the tokenizer by using the model_type
from the config.json
(there's a bit more magic in the lookup but that's essentially what it does) - see here. Now here is the key point: There is a tokenizer specified for t5
but there is no tokenizer specified for the encoder-decoder
model type. Hence, load_tokenizer
is False
for the encoder-decoder
model, and tokenizer
remains a string (i.e. the module_dir path) in pipeline()
instead of being actually loaded with a tokenizer. This is why my t5-based model infer-huggingface-translation-t5-small
successfully created a tokenizer but the encoder-decoder-based model infer-...-mail-summarization
did not. To finish call stack...pipeline()
infers the actual pipeline class from the task
, e.g. for a text2text-generation
task it returns an instance of Text2TextGenerationPipeline
.handler_service.predict()
handler or the handle()
handler ), the model called preprocess()
and then Text2TextGenerationPipeline._parse_and_tokenize()
. Inside text2text-generation._parse_and_tokenize()
the tokenizer
was still a string (see 4) and the call self.tokenizer(...)
failed because tokenizer
it cannot be called as a function.So, this is what was going on and why the lack of code/inference.py
in the model's tar.gz file caused the error and I wrongly assumed it was due to the payload format.