Search code examples
kubernetescontinuous-integrationdocker-in-dockercalicogitlab-autodevops

GitLab Auto DevOps on Kubernetes hangs, network timeouts, cannot execute yj


When using GitLab Auto DevOps to build and deploy application from my repository to microk8s, the build jobs often take a long time to run, eventually timing out. The issue happens 99% of the time, but some builds run through. Often, the build stops at a different time in the build script.

The projects do not contain a .gitlab-ci.yml file and fully rely on the Auto DevOps feature to do its magic.

For Spring Boot/Java projects, the build often fails when downloading the Gradle via the Gradle wrapper, other times it fails while downloading the dependencies itself. The error message is very vague and not helpful at all:

Step 5/11 : RUN /bin/herokuish buildpack build
 ---> Running in e9ec110c0dfe
       -----> Gradle app detected
-----> Spring Boot detected
The command '/bin/sh -c /bin/herokuish buildpack build' returned a non-zero code: 35

Sometimes, if you get lucky, the error is different:

Step 5/11 : RUN /bin/herokuish buildpack build
 ---> Running in fe284971a79c
       -----> Gradle app detected
-----> Spring Boot detected
-----> Installing JDK 11... done
-----> Building Gradle app...
-----> executing ./gradlew build -x check
       Downloading https://services.gradle.org/distributions/gradle-7.0-bin.zip
       ..........10%...........20%...........30%..........40%...........50%...........60%...........70%..........80%...........90%...........100%
       To honour the JVM settings for this build a single-use Daemon process will be forked. See https://docs.gradle.org/7.0/userguide/gradle_daemon.html#sec:disabling_the_daemon.
       Daemon will be stopped at the end of the build
       > Task :compileJava
       > Task :compileJava FAILED
       
       FAILURE: Build failed with an exception.
       
       * What went wrong:
       Execution failed for task ':compileJava'.
       > Could not download netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar (io.netty:netty-resolver-dns-native-macos:4.1.65.Final)
       > Could not get resource 'https://repo.maven.apache.org/maven2/io/netty/netty-resolver-dns-native-macos/4.1.65.Final/netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar'.
       > Could not GET 'https://repo.maven.apache.org/maven2/io/netty/netty-resolver-dns-native-macos/4.1.65.Final/netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar'.
       > Read timed out

For React/TypeScript projects, the symptoms are similar but the error itself manifests in a different way:

[INFO] Using npm v8.1.0 from package.json
/cnb/buildpacks/heroku_nodejs-npm/0.4.4/lib/build.sh: line 179: /layers/heroku_nodejs-engine/toolbox/bin/yj: Permission denied
ERROR: failed to build: exit status 126
ERROR: failed to build: executing lifecycle: failed with status code: 145

The problem seems to occur mostly when the GitLab runners itself are deplyoed in Kubernetes. microk8s uses Project Calico to implement virtual networks.

What gives? Why are the error messages to unhelpful? Is there a way to turn up verbose build logs or debug the build steps?


Solution

  • This seems to be a networking problem caused by incompatbile MTU settings between the Calico network layer and Docker's network configuration (and an inability to autoconfige the MTU correctly?) When the MTU values don't match, network packets get fragmented and the Docker runners fail to complete TLS handshakes. As far as I understand, this only affects DIND (docker-in-docker) runners.

    Even finding this out requires jumping through a few hoops. You have to:

    1. Start a CI pipeline and wait for the job to "hang"
    2. kubectl exec into the current/active GitLab runner pod
    3. Find out the correct value for the DOCKER_HOST environment variable (e.g. by grepping through /proc/$pid/environ. Very likely, this will be tcp://localhost:2375.
    4. Export the value to be used by the docker client: export DOCKER_HOST=tcp://localhost:2375
    5. docker ps and then docker exec into the actual CI job container
    6. Use ping and other tools to find proper MTU values (but MTU for what? Docker, Calico, OS, router, …?). Use curl/openssl to verify that (certain) https sites cause problems from inside the DIND container.

    Execute

    microk8s kubectl get -n kube-system cm calico-config -o yaml
    

    and look for the veth_mtu value, which will very likely be set to 1440. DIND uses the same MTU and thus fails send or receive certain network packages (each virtual network needs to add its own header to the network packet, which adds a few bytes at every layer).

    The naïve fix would be to change the Calico settings to a higher or lower value, but somehow this did not really work, even after the Calico deployment. Furthermore, the value seems to be reset to its original value from time to time; probably caused by automatic updates to microk8s (which comes as a Snap).

    So what is a solution that actually works and is permanent? It is possible to override DIND settings for Auto DevOps by writing a custom .gitlab-ci.yml file and simply including the Auto DevOps template:

    build:
      services:
        - name: docker:20.10.6-dind # make sure to update version
          command: ['--tls=false', '--host=tcp://0.0.0.0:2375', '--mtu=1240']
    
    include:
        - template: Auto-DevOps.gitlab-ci.yml
    

    The build.services definition is copied from the Jobs/Build.gitlab-ci template and extended with an additional --mtu option.

    I've had good experience so far by setting the DIND MTU to 1240, which is 200 bytes lower than Calico's MTU. As an added bonus, it doesn't affect any other pods' network settings. And for CI builds I can live with non-optimal network settings.

    Update 2024

    The symptoms started showing again: slow pipelines with high failure rate (>80%) sometimes related to network timeouts and sometimes seemingly random errors.

    If you run microk8s 1.24+ with Calico 3.21+, make sure to set veth_mtu in the Calico config map to "0". If you have upgraded from an earlier version, chances are high the configmap still sets it to a non-zero value such a "1440".

    Check your current values with:

    kubectl -n kube-system get -oyaml daemonset.apps/calico-node | grep 'image:'
    kubectl -n kube-system get -oyaml configmap/calico-config | grep 'veth_mtu:'
    

    Setting it to 0 seems to properly auto-detect the correct MTU value. The workaround with manually specifying a lower MTU in the .gitlab-ci.yml file is no longer required and the manual server override can be removed.


    References: