How to get PyTorch 2.0 on Amazon EC2 G5g instances to detect CUDA

I have been trying to use a G5g EC2 instance with PyTorch 2.0 but I have been struggling to get it working. I want this specific instance because the arm processor makes it significantly cheaper, and this is the only arm instance with a GPU. Amazon had been bragging about PyTorch 2.0 optimization on Graviton (see here) so I had figured there would be an AMI that came preinstalled with all of this done, however after talking to AWS support that is not the case.

I have tried using AMIs that come with CUDA 11.4 and PyTorch 1.1, and then upgrading both of them, however no matter what I do PyTorch will not install as the CUDA version. I have followed the commands on PyTorch's website to install that specific version, pointing to the CUDA 11.8 wheel after having CUDA 11.8 installed:

pip3 install torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu118

but still PyTorch installs as the CPU version. I can confirm this when I run

>>import torch
>>torch.cuda.is_available()
False

I verified that the CUDA version was 11.8 by running nvidia-smi. I have also tried starting from a blank slate AMI and installing CUDA, then PyTorch, but this led to the same result. The only success I was able to have was on an instance with x86 architecture, but that is not enough for me.

Solution

If you are trying one of the pre-built Deep Learning AMIs for AWS Graviton2 specifically G5g family, they will lead you nowhere. Trust me I have been there myself, just like you I was also adamant about making it work with Graviton2, for me it was extra cost, and their availability in Spot Market, You must have got your reasons as well.

Setting up one by yourself is not a difficult task if you know what exactly will work, I have figured it out the hard way and wrote a detailed guide about the exact same issue and we are running that in production. I installed the latest Nvidia drivers, Cuda 12.2 and CUDNN and Pytorch 2. Following is the snippet from the script to install the GPU driver and toolkit

setup_gpu() {
    echo "Setting up GPU..."
    wget https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-aarch64-535.104.05.run
    sh NVIDIA-Linux-aarch64-535.104.05.run --disable-nouveau --silent
    wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux_sbsa.run
    sh cuda_12.2.2_535.104.05_linux_sbsa.run --silent --override --toolkit --samples --toolkitpath=/usr/local/cuda-12.2 --samplespath=$CUDA_HOME --no-opengl-libs
    wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-8.9.4.25_cuda12-archive.tar.xz
    tar -xf cudnn-linux-sbsa-8.9.4.25_cuda12-archive.tar.xz
    cp -P cudnn-linux-sbsa-8.9.4.25_cuda12-archive/include/* $CUDA_HOME/include/
    cp -P cudnn-linux-sbsa-8.9.4.25_cuda12-archive/lib/* $CUDA_HOME/lib64/
    chmod a+r $CUDA_HOME/lib64/*
    ldconfig
    rm -fr cu* NVIDIA*
}

For PyTorch, you can use the following

# Install PyTorch from source
install_pytorch() {
    echo "Installing PyTorch..."
    wget https://github.com/ccache/ccache/releases/download/v4.8.3/ccache-4.8.3.tar.xz
    tar -xf ccache-4.8.3.tar.xz
    pushd ccache-4.8.3
    cmake .
    make -j $CPUS
    popd
    dnf install -y numpy
    $USER_EXEC pip3 install typing-extensions
    git clone --recursive https://github.com/pytorch/pytorch.git
    pushd pytorch
    python3 setup.py install
    popd
    ldconfig
    $USER_EXEC pip3 install sympy filelock fsspec networkx
}

These snippets need some pre-requisite to install, as well as some custom environment variables, hence you should follow the complete guide about How to create a custom deep learning AMI.

The Guide not only explains each step in detail but I have also created a single comprehensive script to do all the dirty work for you.

Setting up one yourself instead of a DLAMI from the marketplace will save you time, instance/block storage, and hence reduce setup (spawn) time. Above all it gives you control over what you need for your environment, i have discussed these issue in detail at Why your deep-learning AMI is holding you back. Do check it out as I am sure you must have encountered some of the issues if not all while working with DLAMIs.

Disclaimer: I am the author of both of these articles, and i wrote them to share my experience and help others to avoid troubles I had to go through.