Search code examples
dockerhyper-vdocker-for-windowsdocker-networkgit-lfs

Docker connectivity issues (to Azure DevOps Services from self hosted Linux Docker agent)


I am looking for some advice on debugging some extremely painful Docker connectivity issues.

In particular, for an Azure DevOps Services Git repository, I am running a self-hosted (locally) dockerized Linux CI (setup according to https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux), which has been working fine for a few months now.

All this runs on a company network, and since last week the network connection of my docker container became highly unstable:

  • Specifically it intermittently looses network connection, which is also visible via the logs of the Azure DevOps agent, which then keeps trying to reconnect.

  • This especially happens while downloading Git LFS objects. Enabling extra traces via GIT_TRACE=1 highlights a lot of connection failures and retries:

    trace git-lfs: xfer: failed to resume download for "SHA" from byte N: expected status code 206, received 200. Re-downloading from start

  • During such a LFS pull / fetch, sometimes the container even stops responding as a docker container list command only responds:

    Error response from daemon: i/o timeout

    As a result the daemon cannot recover on its own, and needs a manual restart (to get back up the CI).

Also I see remarkable differences in network performance:

  • Manually cloning the same Git repository (including LFS objects, all from scratch) in container instances (created from the same image) on different machines, takes less than 2mins on my dev laptop machine (connected from home via VPN), while the same operation easily takes up to 20minutes (!) on containers running two different Win10 machines (company network, physically located in offices, hence no VPN.
  • Clearly this is not about the host network connection itself, since cloning on the same Win10 hosts (company network/offices) outside of the containers takes only 14seconds!

Hence I am suspecting some network configuration issues (e.g. sth with the Hyper-V vEthernet Adapter? Firewall? Proxy? or whichever other watchdog going astray?), but after three days of debugging, I am not quite sure how to further investigate this issue, as I am running out of ideas and expertise. Any thoughts / advice / hints?

I should add that LFS configuration options (such as lfs.concurrenttransfers and lfs.basictransfersonly) did not really help, similarly for git config http.version (or just removing some larger files)


UPDATE

it does not actually seem to be about the self-hosted agent but a more general docker network cfg issue within my corporate network.

Running the following works consistently fast on my VPN machine (running from home):

docker run -it
ubuntu bash -c "apt-get update; apt-get install -y wget; start=$SECONDS;
wget http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso;
echo Duration: $(( SECONDS - start )) seconds"

Comparision with powershell download (on the host):

$start=Get-Date
$(New-Object
net.webclient).Downloadfile("http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso",
"e:/temp/lubuntu-18.04-alternate-amd64.iso")
'Duration: {0:mm}
min {0:ss} sec' -f ($(Get-Date)-$start)

Corporate network

  • Docker: 1560 seconds (=26 min!)
  • Windows host sys: Duration: 00 min 15 sec

Dev laptop (VPN, from home):

  • Docker: 144 seconds (=2min 24sec)
  • Windows host sys: Duration: 02 min 16 sec

Looking at the issues discussed in https://github.com/docker/for-win/issues/698 (and proposed workaround which didn't work for me), it seems to be a non-trivial problem with Windows / hyper-v ..


Solution

  • The whole issue "solved itself" when my company decided to finally upgrade from Win10 1803 to 1909 (which comes with WSL, replacing Hyper-V) .. 😂 Now everything runs supersmoothly (I kept running these tests for almost 20 times)