Search code examples
amazon-web-servicesamazon-ec2deep-learningdeploymentobject-detection

AWS EC2:NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running


So I am deploying an OCR model specifically detectron2 on AWS ec2 instance. But the detectron2 started giving an error for GPU. So when I run the command nvidia-smi in the ec2 terminal it shows the error as in title.

Note: I have chosen the Deep Learning PyTorch AMI with CUDA already enabled.

I need GPU to run my model but as it is not detecting it is failing. I had tried this same thing before a month to check if it runs and it worked perfectly but now suddenly all the AMI's with CUDA has started giving this errors.


Solution

  • You are using a DLAMI with a non GPU instance. As t2.large does not have nvidia GPU. Your best bet is to either use G4dn.xlarge or G4g.xlarge.

    But i would recommend you not to waste energies on market place available AMIs instead use a base Linux image and install only the necessary drivers on top of it. I wrote a guide on how to create custom deep learning AMI but if you only need driver you can skip the guide or follow the GPU drive installation process mentioned here.

    Specifically this part

    if [ "$(uname -m)" = "aarch64" ]; then
        echo "System is running on ARM / AArch64"
        DRIVE_URL="https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-aarch64-535.104.05.run"
        CUDA_SDK_URL="https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux_sbsa.run"
        CUDNN_ARCHIVE_URL="https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-8.9.5.29_cuda12-archive.tar.xz"
    else
        DRIVE_URL="https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run"
        CUDA_SDK_URL="https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run"
        CUDNN_ARCHIVE_URL="https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz"
    fi
    echo "Setting up GPU..."
    DRIVER_NAME="NVIDIA-Linux-driver.run"
    wget -O "$DRIVER_NAME" "$DRIVE_URL"
    TMPDIR=$LOCAL_TMP sh "$DRIVER_NAME" --disable-nouveau --silent
    
    CUDA_SDK="cuda-linux.run"
    wget -O "$CUDA_SDK" "$CUDA_SDK_URL"
    TMPDIR=$LOCAL_TMP sh "$CUDA_SDK" --silent --override --toolkit --samples --toolkitpath=$USR_LOCAL_PREFIX/cuda-12.2 --samplespath=$CUDA_HOME --no-opengl-libs
    
    CUDNN_ARCHIVE="cudnn-linux.tar.xz"
    EXTRACT_PATH="$SRC_DIR/cudnn-extracted"
    mkdir -p "$EXTRACT_PATH"
    
    wget -O "$CUDNN_ARCHIVE" "$CUDNN_ARCHIVE_URL"
    tar -xJf "$CUDNN_ARCHIVE" -C "$EXTRACT_PATH"
    CUDNN_INCLUDE=$(find "$EXTRACT_PATH" -type d -name "include" -print -quit)
    CUDNN_LIB=$(find "$EXTRACT_PATH" -type d -name "lib" -print -quit)
    cp -P "$CUDNN_INCLUDE"/* $CUDA_HOME/include/
    cp -P "$CUDNN_LIB"/* $CUDA_HOME/lib64/
    chmod a+r $CUDA_HOME/lib64/*
    ldconfig
    

    Disclaimer: I am the author of these articles where i mentioned in detail how to setup GPU in the cloud.