I am looking for some advice on debugging some extremely painful Docker connectivity issues.
In particular, for an Azure DevOps Services Git repository, I am running a self-hosted (locally) dockerized Linux CI (setup according to https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux), which has been working fine for a few months now.
All this runs on a company network, and since last week the network connection of my docker container became highly unstable:
Specifically it intermittently looses network connection, which is also visible via the logs of the Azure DevOps agent, which then keeps trying to reconnect.
This especially happens while downloading Git LFS objects. Enabling extra traces via GIT_TRACE=1 highlights a lot of connection failures and retries:
trace git-lfs: xfer: failed to resume download for "SHA" from byte N: expected status code 206, received 200. Re-downloading from start
During such a LFS pull / fetch, sometimes the container even stops responding as a docker container list
command only responds:
Error response from daemon: i/o timeout
As a result the daemon cannot recover on its own, and needs a manual restart (to get back up the CI).
Also I see remarkable differences in network performance:
Hence I am suspecting some network configuration issues (e.g. sth with the Hyper-V vEthernet Adapter? Firewall? Proxy? or whichever other watchdog going astray?), but after three days of debugging, I am not quite sure how to further investigate this issue, as I am running out of ideas and expertise. Any thoughts / advice / hints?
I should add that LFS configuration options (such as lfs.concurrenttransfers and lfs.basictransfersonly) did not really help, similarly for git config http.version (or just removing some larger files)
it does not actually seem to be about the self-hosted agent but a more general docker network cfg issue within my corporate network.
Running the following works consistently fast on my VPN machine (running from home):
docker run -it
ubuntu bash -c "apt-get update; apt-get install -y wget; start=$SECONDS;
wget http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso;
echo Duration: $(( SECONDS - start )) seconds"
Comparision with powershell download (on the host):
$start=Get-Date
$(New-Object
net.webclient).Downloadfile("http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso",
"e:/temp/lubuntu-18.04-alternate-amd64.iso")
'Duration: {0:mm}
min {0:ss} sec' -f ($(Get-Date)-$start)
Corporate network
Dev laptop (VPN, from home):
Looking at the issues discussed in https://github.com/docker/for-win/issues/698 (and proposed workaround which didn't work for me), it seems to be a non-trivial problem with Windows / hyper-v ..
The whole issue "solved itself" when my company decided to finally upgrade from Win10 1803 to 1909 (which comes with WSL, replacing Hyper-V) .. 😂 Now everything runs supersmoothly (I kept running these tests for almost 20 times)