pytorch cuda nvidia huggingface-transformers large-language-model

RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false

I want to run a local FineTuning of LLama. I followed the colab notebook from the pytorch blog post "Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem".

I got everything up and running but in the training i get the a runtime error:

RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

In my understanding is a specific version needed by pytorch. The sm_80 or sm_90. The RTX 2080 ti has a base version of sm_75 but also has the sm_80 and sm_90 flag.

I checked the configuration stats of my RTX 2080 TI GPU with print(torch.__config__.show().replace("\n", "\n\t")) and it got this:

PyTorch built with:
      - GCC 9.3
      - C++ Version: 201703
      - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
      - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
      - OpenMP 201511 (a.k.a. OpenMP 4.5)
      - LAPACK is enabled (usually provided by MKL)
      - NNPACK is enabled
      - CPU capability usage: AVX2
      - CUDA Runtime 12.1
      - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
      - CuDNN 8.9.2
      - Magma 2.6.1
      - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

What can I do to enable the training or is the card not capable of it?

The full error report ist here:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[19], line 2
      1 ## start training
----> 2 trainer.train()

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:323, in SFTTrainer.train(self, *args, **kwargs)
    320 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    321     self.model = self._trl_activate_neftune(self.model)
--> 323 output = super().train(*args, **kwargs)
    325 # After training we make sure to retrieve back the original forward pass method
    326 # for the embedding layer by removing the forward post hook.
    327 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/transformers/trainer.py:1539, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1537         hf_hub_utils.enable_progress_bars()
   1538 else:
-> 1539     return inner_training_loop(
   1540         args=args,
   1541         resume_from_checkpoint=resume_from_checkpoint,
   1542         trial=trial,
   1543         ignore_keys_for_eval=ignore_keys_for_eval,
   1544     )

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/transformers/trainer.py:1869, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1866     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1868 with self.accelerator.accumulate(model):
-> 1869     tr_loss_step = self.training_step(model, inputs)
   1871 if (
   1872     args.logging_nan_inf_filter
   1873     and not is_torch_tpu_available()
   1874     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1875 ):
   1876     # if loss is nan or inf simply add the average of previous logged losses
   1877     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/transformers/trainer.py:2777, in Trainer.training_step(self, model, inputs)
   2775         scaled_loss.backward()
   2776 else:
-> 2777     self.accelerator.backward(loss)
   2779 return loss.detach() / self.args.gradient_accumulation_steps

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/accelerate/accelerator.py:1964, in Accelerator.backward(self, loss, **kwargs)
   1962     self.scaler.scale(loss).backward(**kwargs)
   1963 else:
-> 1964     loss.backward(**kwargs)

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/torch/autograd/function.py:288, in BackwardCFunction.apply(self, *args)
    282     raise RuntimeError(
    283         "Implementing both 'backward' and 'vjp' for a custom "
    284         "Function is not allowed. You should only implement one "
    285         "of them."
    286     )
    287 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn
--> 288 return user_fn(self, *args)

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/torch/utils/checkpoint.py:288, in CheckpointFunction.backward(ctx, *args)
    283 if len(outputs_with_grad) == 0:
    284     raise RuntimeError(
    285         "none of output has requires_grad=True,"
    286         " this checkpoint() is not necessary"
    287     )
--> 288 torch.autograd.backward(outputs_with_grad, args_with_grad)
    289 grads = tuple(
    290     inp.grad if isinstance(inp, torch.Tensor) else None
    291     for inp in detached_inputs
    292 )
    294 return (None, None) + grads

File ~/thesis/thesis-localllm-codetuning/training22_04/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

I tried upgrading pytorch to the latest version, restarting the PC, restarting the jupyter notebook kernel.

I also installed the latest CUDA version and nvidia toolkits.

Solution

Installing a specific version of pytorch fixed my issue.

I used this:

pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu117

From the GitHub issue comment: installing the nightly

Thanks to @palonix for the hint to the GitHub issue!