Search code examples
tfsazure-pipelineswindows-server-2012-r2vstest

TFS 2018 Release Process - Mysterious Server Restart On "Deploy TestAgent" Build Step


I have interesting issue with the latest version of on-premise TFS (2018 Version 16.122.27102.1). I have a release process that includes a step for "Deploy TestAgent on localhost". Looks like this:

enter image description here

Normally work great, worked great when I was using TFS 2012, but recently we upgraded to 2018 and now when this process runs on a certain build agent(Agent-19 only), occasionally I get a strange failure:

Operating system is shutting down for computer 'XXX_TESTING'

The agent: Agent-19 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

Strange, the restart seem to be generated from the same service account as the TFS Build Agent uses:

enter image description here

Not whole lot of information there, the TFS build worker log doesn't have to much information either:

[2018-03-01 00:46:35Z INFO ProcessInvoker] Starting process:

[2018-03-01 00:46:35Z INFO ProcessInvoker] File name: 'C:\TFS Agent\externals\vstshost\LegacyVSTSPowerShellHost.exe'

[2018-03-01 00:46:35Z INFO ProcessInvoker] Arguments: ''

[2018-03-01 00:46:35Z INFO ProcessInvoker] Working directory: 'C:\TFS Agent_work_tasks\DeployVisualStudioTestAgent_52a38a6a-1517-41d7-96cc-73ee0c60d2b6\1.0.42'

[2018-03-01 00:46:35Z INFO ProcessInvoker] Require exit code zero: 'False'

[2018-03-01 00:46:35Z INFO ProcessInvoker] Encoding web name: ; code page: ''

[2018-03-01 00:46:35Z INFO ProcessInvoker] Force kill process on cancellation: 'False'

[2018-03-01 00:46:35Z INFO ProcessInvoker] Process started with process id 14620, waiting for process exit.

[2018-03-01 00:46:35Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1.

[2018-03-01 00:48:11Z INFO Worker] Cancellation/Shutdown message received.

[2018-03-01 00:48:11Z INFO HostContext] Agent will be shutdown for OperatingSystemShutdown

[2018-03-01 00:48:11Z INFO StepsRunner] Cancel current running step.

So, system shuts down, agent stops, tests don't run, but why, no idea... So I re-image the entire server with a copy of one of my other build server, re-install the build agent, but the issue persists, and it only occurs on that build server, only on that step, and only "sometimes" (I haven't identified a pattern, but generally during the nightly run at 6:30PM CST).

How do I diagnose this? Is there a place that would tell me "why" a system restarted? This doesn't really give me a whole lot of information... I searched around and I don't see anyone else with an issue of this nature.


Solution

  • First of all, the Deploy test Agent step is deprecated, it has been replaced with the new agent infrastructure and the VS Test 2.0 runners. See:

    The install test agent step was meant to install the test agent to a different server/VM, not on the agent.

    enter image description here

    The build/release agent will be alive to monitor the test agent coming back to life to run the tests. Reasons, why the agent may trigger a reboot, can be found here: