huggingface-transformers azure-machine-learning-service

Nvidia driver too old error when loading bart model onto CUDA, works on other models

I'm getting an error loading a HuggingFace model on an AzureML GPU compute. Loading other models works, such as the first one in the example below:

from transformers import AutoModelForCausalLM

device = "cuda"

checkpoint1 = "Salesforce/codegen-350M-mono" 
# this works!!
codegen = AutoModelForCausalLM.from_pretrained(checkpoint1, trust_remote_code=True).to(device)

checkpoint2 = "facebook/bart-large"
# this doesn't
bart = AutoModelForCausalLM.from_pretrained(checkpoint2, trust_remote_code=True).to(device)

Here's the full error:

RuntimeError: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

I understand the drivers don't match up with the torch version but messing around with the drivers on the machine seems like it would be something that Azure would have accounted for.

Here are my relevant libraries for references:

transformers 4.34.0
torch 2.1.0
nvidia-cublas-cu12      12.1.3.1
nvidia-cuda-cupti-cu12      12.1.105
nvidia-cuda-nvrtc-cu12      12.1.105
nvidia-cuda-runtime-cu12    12.1.105
nvidia-cudnn-cu12       8.9.2.26
nvidia-cufft-cu12       11.0.2.54
nvidia-curand-cu12      10.3.2.106
nvidia-cusolver-cu12        11.4.5.107
nvidia-cusparse-cu12        12.1.0.106
nvidia-nccl-cu12        2.18.1
nvidia-nvjitlink-cu12       12.2.140
nvidia-nvtx-cu12        12.1.105

Note also that I install pytorch when I install transformers like this: pip install transformers[torch]

I uses pip since that's the recommended way.

Is there something about the Bart model that requires a different GPU/torch config compared to other models? Is this a problem with some AzureML compute configs?

Solution

Installing pytorch through transformers extras probably not the best way to get compatible torch to your environment. Based on the cuda drivers in your base image you can try to install torch recommended way https://pytorch.org/. It should be fine with transformers that only pin a lower bound. Alternatively you can try more recent cuda image from nvcr.io if you have an option to specify it.

*posting comment on the question as answer