rust amazon-ec2 arm64 ubuntu-20.04 aws-graviton

Why does this Rust package take 2x as long to build for aarch64 than it does for x86-64?

Context: At work, I am compiling a set of packages for Intel64 and ARM64, and bundling them up into Linux packages (.rpm, .deb, .apk). I'm building out a whole pipeline to enable this, which will feed into our Artifactory installation.

We're building out self-hosted, native ARM64 runners for our GitHub Enterprise Server system, to pair alongside our existing self-hosted Intel64 runners. Our runners are built atop Amazon EC2 instances. Both are compute-optimized.

CPU Arch	Instance type	vCPUs	Memory	`uname -m`
`amd64`/`x86_64`	`c5.2xlarge`	8	16 GB	`x86_64`
`arm64`/`aarch64`	`c6g.2xlarge`	8	16 GB	`aarch64`
`arm64`/`aarch64`	`c7g.2xlarge`	8	16 GB	`aarch64`

Hosts are EKS clusters running EKS-optimized Amazon Linux 2.

Our "runners" are K8S/EKS pods (~Docker containers) that die/re-spawn after each individual workflow. The Docker image is a multi-platform image — same software, same configuration, multiple CPUs. The container OS is Ubuntu "Focal Fossa" 20.04 LTS.

Using our self-hosted GitHub Actions runners, I wrote the GHA workflow to download the source of a Rust project and compile it — once on the Intel runner, and once on the ARM64 runner. I run uname -m and output the result as the first step in the workflow, and I see what I'm expecting. I also run file against the compiled binary, and I also see what I'm expecting.

(I'm working very hard to have an apples-to-apples comparison here.)

I'm building https://github.com/lycheeverse/lychee as a test project of the pipeline. I've not (yet) tested other compilations, but this felt complex enough to put the new ARM64 runners through the paces.

Here is the build script (${ARCH} is either x86-64 or aarch64, as appropriate):

sudo apt-get -y update
sudo apt-get -y install --no-install-recommends \
  build-essential \
  ca-certificates \
  curl \
  file \
  git \
  gpg \
  gpg-agent \
  gzip \
  libssl-dev \
  openssh-client \
  pkg-config \
  software-properties-common \
  tar \
  wget \
  ;

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "${HOME}/.cargo/env"

# shellcheck disable=2154
wget --header "Authorization: Bearer ${GITHUB_TOKEN}" \
    "https://github.com/lycheeverse/lychee/archive/refs/tags/v0.15.1.tar.gz"
tar zxvf "v0.15.1.tar.gz"

### Start measuring time

cd "lychee-0.15.1/" || true
cargo fetch --target="${ARCH}-unknown-linux-gnu" --locked
cargo install cargo-auditable --locked

# The `mold` linker is pre-installed.
mold -run cargo auditable build --timings --frozen --release

sudo install -Dm755 target/release/lychee -t "/usr/local/bin/"

### Stop measuring time

The Intel runner is one generation older than the Graviton/ARM64 runner. Same vCPUs, same amount of memory available. In the measured amount of time (script, above), I'm seeing these results (average of 5 builds):

Intel (c5): 6m 18s (baseline)
Graviton 2 (c6g): 12m 2s (~2x)
Graviton 3 (c7g): 9m 36s (~1.5x)

I was expecting parity between the two CPU architectures, or maybe a slight edge for ARM64 seeing that the Graviton instance is one generation newer. I also know that languages like Haskell are still working on bringing things to ARM64, and I wonder if the same is true for Rust.

For Rustaceans: are there parts of the Rust build pipeline that are not yet optimized for ARM64 on glibc-based Linuxes?

Next, I'm going to try building a significant project in Go, just to try another language that I know is optimized for ARM64, and attempt to rule-out issues with the Graviton processors. I'm also going to set up another representative Rust project to see if I get different results.

Update (same day): I performed the same test on a Go project (OpenTofu). It has over 300,000 lines of code, and also depends on several external dependencies that have to be downloaded and compiled.

Intel (c5): 3m 35s (baseline)
Graviton 3 (c7g): 2m 18s (~0.64x)

Here, arm64 was a 36% improvement over the 2-generation-old Intel instance. So I don't think my Rust issue is related to Amazon lying about Graviton price-performance. I think is has to do with something about Rust or lychee specifically.

Solution

To improve performance of Rust on Graviton, you should specify the use of the large-system extensions (LSE) via RUSTFLAGS before building your project. LSE is included in the Armv8.1 architecture, and improves overall system throughput. Graviton 2 and beyond, based on the Neoverse CPU line, all include the LSE feature.

Enable LSE with the following line of code before your cargo build --release command:

export RUSTFLAGS="-Ctarget-feature=+lse"

So your final code should look like this after your sudo apt-get installs:

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "${HOME}/.cargo/env"

# shellcheck disable=2154
wget --header "Authorization: Bearer ${GITHUB_TOKEN}" \
    "https://github.com/lycheeverse/lychee/archive/refs/tags/v0.15.1.tar.gz"
tar zxvf "v0.15.1.tar.gz"

### Start measuring time

cd "lychee-0.15.1/" || true
cargo fetch --target="${ARCH}-unknown-linux-gnu" --locked
cargo install cargo-auditable --locked

# Set RUSTFLAGS to enable LSE
export RUSTFLAGS="-Ctarget-feature=+lse"

# The `mold` linker is pre-installed.
mold -run cargo auditable build --timings --frozen --release

sudo install -Dm755 target/release/lychee -t "/usr/local/bin/"

### Stop measuring time