CUDA error: device-side assert triggered on tensor.to(device='cuda')

An ML Model is running under Triton Inference Server on a GPU instance group and after a certain amount of successful inferences starts throwing the exception: CUDA error: device-side assert triggered

With export CUDA_LAUNCH_BLOCKING=1 the stacktrace points to {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}:

Traceback (most recent call last):
  File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 129, in compute_code_embeddings
    inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
  File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 129, in <dictcomp>
    inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Here is a simplified form of the problematic code:

max_length = llm.config.max_position_embeddings

# inputs is a dict with keys: [input_ids, attention_mask]
inputs = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True, padding=True)

# Move the inputs to the CUDA device
inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}

with torch.no_grad():
    outputs = llm(**inputs)

Where:

COMPUTE_DEVICE is torch.device('cuda')
llm and tokenizer are loaded via transformers library from Graph-CodeBERT
once the exception occurs, all following InderenceRequests yield error, and the Triton Server needs to be restarted
the inputs looks valid with: dtype:torch.int64, size:(1, xxx), device:cpu, has_NAN:False, has_inf:False
GPU VRAM is usually under 20% when the exception occurs

Help and recommendation are appreciated!

Solution

The issue was caused by using max_position_embeddings of size 514 from graphcodebert config:

max_length = llm.config.max_position_embeddings
inputs = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True, padding=True)

while in fact, the 512, which is standard for BERT models allowed Tokenizer to produce valid outputs.

Few notes on debugging process:

Set environment value export CUDA_LAUNCH_BLOCKING=1
Increase logging by adding --log-verbose to the Triton IS command: tritonserver --model-repository /opt/triton_models/ --log-verbose=1

Look for the first exception, which in my case was:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [523,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 140, in compute_code_embeddings
    outputs = llm(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/roberta/modeling_roberta.py", line 828, in forward
    embedding_output = self.embeddings(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/roberta/modeling_roberta.py", line 130, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It turns out that once CUDA error is thrown, the model enters undetermined state and any following CUDA Tensor operation will produce CUDA error: device-side assert triggered, heavily polluting the logs
Manually collect offending text and replicate failing operations in local Jupyter Notebook on a CPU, which caused:
```
IndexError: index out of range in self
```

Helpful discussion:

https://discuss.pytorch.org/t/solved-assertion-srcindex-srcselectdimsize-failed-on-gpu-for-torch-cat/1804/9