So I am deploying an OCR model specifically detectron2 on AWS ec2 instance. But the detectron2 started giving an error for GPU. So when I run the command nvidia-smi in the ec2 terminal it shows the error as in title.
Note: I have chosen the Deep Learning PyTorch AMI with CUDA already enabled.
I need GPU to run my model but as it is not detecting it is failing. I had tried this same thing before a month to check if it runs and it worked perfectly but now suddenly all the AMI's with CUDA has started giving this errors.
You are using a DLAMI with a non GPU instance. As t2.large does not have nvidia GPU. Your best bet is to either use G4dn.xlarge or G4g.xlarge.
But i would recommend you not to waste energies on market place available AMIs instead use a base Linux image and install only the necessary drivers on top of it. I wrote a guide on how to create custom deep learning AMI but if you only need driver you can skip the guide or follow the GPU drive installation process mentioned here.
Specifically this part
if [ "$(uname -m)" = "aarch64" ]; then
echo "System is running on ARM / AArch64"
DRIVE_URL="https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-aarch64-535.104.05.run"
CUDA_SDK_URL="https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux_sbsa.run"
CUDNN_ARCHIVE_URL="https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-8.9.5.29_cuda12-archive.tar.xz"
else
DRIVE_URL="https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run"
CUDA_SDK_URL="https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run"
CUDNN_ARCHIVE_URL="https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz"
fi
echo "Setting up GPU..."
DRIVER_NAME="NVIDIA-Linux-driver.run"
wget -O "$DRIVER_NAME" "$DRIVE_URL"
TMPDIR=$LOCAL_TMP sh "$DRIVER_NAME" --disable-nouveau --silent
CUDA_SDK="cuda-linux.run"
wget -O "$CUDA_SDK" "$CUDA_SDK_URL"
TMPDIR=$LOCAL_TMP sh "$CUDA_SDK" --silent --override --toolkit --samples --toolkitpath=$USR_LOCAL_PREFIX/cuda-12.2 --samplespath=$CUDA_HOME --no-opengl-libs
CUDNN_ARCHIVE="cudnn-linux.tar.xz"
EXTRACT_PATH="$SRC_DIR/cudnn-extracted"
mkdir -p "$EXTRACT_PATH"
wget -O "$CUDNN_ARCHIVE" "$CUDNN_ARCHIVE_URL"
tar -xJf "$CUDNN_ARCHIVE" -C "$EXTRACT_PATH"
CUDNN_INCLUDE=$(find "$EXTRACT_PATH" -type d -name "include" -print -quit)
CUDNN_LIB=$(find "$EXTRACT_PATH" -type d -name "lib" -print -quit)
cp -P "$CUDNN_INCLUDE"/* $CUDA_HOME/include/
cp -P "$CUDNN_LIB"/* $CUDA_HOME/lib64/
chmod a+r $CUDA_HOME/lib64/*
ldconfig
Disclaimer: I am the author of these articles where i mentioned in detail how to setup GPU in the cloud.