amazon-web-services amazon-ec2 deep-learning deployment object-detection

AWS EC2:NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

So I am deploying an OCR model specifically detectron2 on AWS ec2 instance. But the detectron2 started giving an error for GPU. So when I run the command nvidia-smi in the ec2 terminal it shows the error as in title.

Note: I have chosen the Deep Learning PyTorch AMI with CUDA already enabled.

I need GPU to run my model but as it is not detecting it is failing. I had tried this same thing before a month to check if it runs and it worked perfectly but now suddenly all the AMI's with CUDA has started giving this errors.

Solution

You are using a DLAMI with a non GPU instance. As t2.large does not have nvidia GPU. Your best bet is to either use G4dn.xlarge or G4g.xlarge.

But i would recommend you not to waste energies on market place available AMIs instead use a base Linux image and install only the necessary drivers on top of it. I wrote a guide on how to create custom deep learning AMI but if you only need driver you can skip the guide or follow the GPU drive installation process mentioned here.

Specifically this part

if [ "$(uname -m)" = "aarch64" ]; then
    echo "System is running on ARM / AArch64"
    DRIVE_URL="https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-aarch64-535.104.05.run"
    CUDA_SDK_URL="https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux_sbsa.run"
    CUDNN_ARCHIVE_URL="https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-8.9.5.29_cuda12-archive.tar.xz"
else
    DRIVE_URL="https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run"
    CUDA_SDK_URL="https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run"
    CUDNN_ARCHIVE_URL="https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz"
fi
echo "Setting up GPU..."
DRIVER_NAME="NVIDIA-Linux-driver.run"
wget -O "$DRIVER_NAME" "$DRIVE_URL"
TMPDIR=$LOCAL_TMP sh "$DRIVER_NAME" --disable-nouveau --silent

CUDA_SDK="cuda-linux.run"
wget -O "$CUDA_SDK" "$CUDA_SDK_URL"
TMPDIR=$LOCAL_TMP sh "$CUDA_SDK" --silent --override --toolkit --samples --toolkitpath=$USR_LOCAL_PREFIX/cuda-12.2 --samplespath=$CUDA_HOME --no-opengl-libs

CUDNN_ARCHIVE="cudnn-linux.tar.xz"
EXTRACT_PATH="$SRC_DIR/cudnn-extracted"
mkdir -p "$EXTRACT_PATH"

wget -O "$CUDNN_ARCHIVE" "$CUDNN_ARCHIVE_URL"
tar -xJf "$CUDNN_ARCHIVE" -C "$EXTRACT_PATH"
CUDNN_INCLUDE=$(find "$EXTRACT_PATH" -type d -name "include" -print -quit)
CUDNN_LIB=$(find "$EXTRACT_PATH" -type d -name "lib" -print -quit)
cp -P "$CUDNN_INCLUDE"/* $CUDA_HOME/include/
cp -P "$CUDNN_LIB"/* $CUDA_HOME/lib64/
chmod a+r $CUDA_HOME/lib64/*
ldconfig

Disclaimer: I am the author of these articles where i mentioned in detail how to setup GPU in the cloud.