Hosting Yocto SSTATE_MIRROR over NFS on a busy Build Node - a bad idea?

In a distributed yocto build environment, is it a bad idea to host a global sstate cache (via SSTATE_MIRRORS) on a busy build node over NFS?

I have recently introduced SSTATE_MIRRORS in our yocto build configuration to try to further reduce yocto builds times on our "build nodes" (Jenkins agents in vSphere and developer workstations). Per the manual, yocto will search SSTATE_MIRRORS for already-built artifacts if they are not found in the local sstate cache (SSTATE_DIR).

All build nodes have a local SSTATE_DIR, in which they cache build results. One of the build nodes (the first Jenkins agent) is designated as the "keeper of the global cache," and exports its local SSTATE_DIR as a r/o NFS share. The other build nodes mount this, and refer to it by SSTATE_MIRRORS in their build configurations. I thought I had a really good idea here and patted myself on the back.

Alas, I'm seeing a significant increase in build times after making the change.

Certainly I have a lot of troubleshooting and measuring to do before drawing any conclusions. We're using NFS v4, and for sure there's tuning to be done there. Also, I suspect the build node hosting the NFS share is intermittently very busy performing yocto builds itself (populating its hybrid local/global cache), leaving little CPU cycles for the kernel to manage NFS requests.

I'm wondering if others can offer advice based on their experiences implementing shared yocto sstate caches.

Solution

It's hard to say exactly what problems you are seeing with some profiling data, but I have a few observations and suggestions.

You are on the right track using NFS as the sstate cache between CI nodes, but I would recommend taking it one step further. Instead of having one node be the "keeper" of the sstate cache and having all the other node use it as a mirror, I would recommend having each node directly mount a common NFS share as SSTATE_DIR. This allows all the nodes to to read and write to the cache during their builds, and does a much better job of keeping it up to date with the required sstate objects. If you only have one node populating the cache, it is unlikely that it is going to contain all of the objects needed by the other builds.

Additionally, I would recommend that the NFS server be a persistent server instead of tied to a Jenkins agent. This gains you a few things:

It means that you can dedicate hardware resources to the cache without having them compete with an ongoing Jenkins build
You can put a simple HTTP server front end that serves up the cache files. Doing this allows your developer workstations to set that HTTP server as their SSTATE_MIRROR, and thus directly benefit from the cache produced by your Jenkins nodes. If you've done this correctly, a developer should be able to replicate a build that was previously built by Jenkins entirely from sstate, which can save a ton of local build time. Even if you aren't exactly replicating a build Jenkins has done before, you still can usually pull a significant amount from sstate.

The final thing to check is if you have hash equivalence enabled. Hash equivalence is a build acceleration technology that allows the bitbake to detect when metadata changes to a recipe that would normally cause it to rebuild would result in the same output as a previously built sstate object, and instead of building it restore it from sstate. This feature is enable by default starting with Yocto 3.0 (codename zeus). If you do not have a hash equivalence server running in your infrastructure, bitbake will start a local server for the duration of your build. However, this can cause some issues when working in a distributed environment like your Jenkins nodes, because hash equivalence is highly dependent on the contents of the sstate cache. If each node is running it's own hash equivalence server locally, they can get diverging sstate hashes (particularly when the CI nodes are transient and the local hash equivalence database is lost), which can result in more sstate misses that is necessary. The solution to this is to either run a hash equivalence server (bitbake comes with one) and point all your CI nodes at it, or disable hash equivalence completely by setting: BB_SIGNATURE_HANDLER = "OEBasicHash".