Search code examples
bazelbinary-reproducibility

Writing genrule with randomness in Bazel


We have a code generator that takes random seed as an input. If no seed specified, it will randomly pick a random seed, which means the outcome is not deterministic:

# generated_code1.h and generated_code2.h are almost always different
my-code-gen -o generated_code1.h
my-code-gen -o generated_code2.h

On the other hand,

# generated_code3.h and generated_code4.h are always the same
my-code-gen --seed 1234 -o generated_code3.h
my-code-gen --seed 1234 -o generated_code4.h

Our first attempt to create a target for the generated code was:

genrule(
    name = "generated_code",
    srcs = [],
    outs = ["generated_code.h"],
    cmd = "my-code-gen -o $@", # Notice that seed not specified
)

However, we think this breaks the hermeticity of targets depending on :generated_code. So we ended up implementing a customized rule and use build_setting (i.e. configuration) to configure the seed for the invocation of my-code-gen.

This makes it possible to specify the seed from CLI to any targets that depends on the generated code, e.g.

bazel build :generated_code --//:code-gen-seed=1234
bazel build :binary --//:code-gen-seed=1234

My questions are:

  1. Consider the genrule definition above, it is calling my-code-gen without --seed which results in non-deterministic output. Does that mean non-hermetic? What is the cost of breaking hermeticity? (e.g. what trouble would it cause in the future?)
  2. I've found --action_env as an alternative to build_setting, which also allow us to pass a seed value from CLI to my-code-gen. Compared to build_setting, what is the preferred approach in our case?

Solution

    1. Consider the genrule definition above, it is calling my-code-gen without --seed which results in non-deterministic output. Does that mean non-hermetic? What is the cost of breaking hermeticity? (e.g. what trouble would it cause in the future?)

    Yes, it's non-hermetic. To be more precise, this is non-determinism, which is a symptom of a non-hermetic build, because the PRNG isn't seeded with a statically known value to the build system. A common other cause of non-determinism is embedding timestamps in build outputs.

    Bazel defines hermeticity as:

    When given the same input source code and product configuration, a hermetic build system always returns the same output by isolating the build from changes to the host system.

    In order to isolate the build, hermetic builds are insensitive to libraries and other software installed on the local or remote host machine. They depend on specific versions of build tools, such as compilers, and dependencies, such as libraries. This makes the build process self-contained as it doesn't rely on services external to the build environment.

    The biggest problem is breaking cacheability of everything that depends on the genrule, because you can no longer trust/guarantee that given a cache key (i.e. hashes of the genrule's inputs, command, environment), the output will be identical and reproducible across build invocations.

    This has costs ranging from

    • basic usability problems ("it works on my machine")
    • build speeds (re-executing commands unnecessarily and wasting compute)
    • cache poisoning (fetching an unexpected output given a cache key)
    • test flakiness (tests implicitly depending on non-deterministic state instead of a fixture)
    • software supply chain security issues (difficulty to verify provenance and reproducibility of release artifacts).
    1. I've found --action_env as an alternative to build_setting, which also allow us to pass a seed value from CLI to my-code-gen. Compared to build_setting, what is the preferred approach in our case?

    The //:code-gen-seed build setting only affects targets that depend on it, but --action_env affects every action. Changes to the build setting would only invalidate the minimal set of targets, and causing minimal re-analysis, cache lookups, and rebuilds, and is thus preferred. You can experiment with this by comparing incremental build speeds with more targets that don't depend on //:code-gen-seed.