Search code examples
machine-learningnvidiaamazon-sagemakertritonserver

how to host/invoke multiple models in nvidia triton server for inference?


based on documentation here, https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/nlp/realtime/triton/multi-model/bert_trition-backend/bert_pytorch_trt_backend_MME.ipynb, I have set up a multi model utilizing gpu instance type and nvidia triton container. looking at the set up in the link, the model is invoked by passing tokens instead of passing text directly to the model. is it possible to pass text directly to the model, given the input type is set to string data type in the config.pbtxt (sample code below) . looking for any examples around this.

config.pbtxt

name: "..."
platform: "..."
max_batch_size : 0
input [
  {
    name: "INPUT_0"
    data_type: TYPE_STRING
    ...
  }
]
output [
  {
    name: "OUTPUT_1"
    ....
  }
]

multi-model invocation



text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
        {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
    ]
}

    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel=f"bert-{i}.tar.gz",
    )


Solution

  • If you want you could make use of an ensemble model in Triton where the first model tokenizes the text and passes it onto the model.

    Take a look at this link that describes the strategy: https://blog.ml6.eu/triton-ensemble-model-for-deploying-transformers-into-production-c0f727c012e3