Search code examples
sql-server.net-coreamazon-ec2azure-web-app-serviceazure-hybrid-connections

Every 1 in 30 connections I get Win32Exception: Unknown location error. Azure web app to AWS SQL DB


We have a couple of .NET Core 3.0 Web Apps (UK South) that connect to a MS SQL 2016 database which is running on an Amazon Windows Server 2016 Datacenter (EC2 instance). We connect via an Azure Relay/Hybrid Connection which is installed on the SQL Server.

It has been working fine for over a year with no errors, but recently we've started getting the following error, about 1 in every 30 connections:

An unhandled exception occurred while processing the request. Win32Exception: An existing connection was forcibly closed by the remote host. Unknown location

SqlException: A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)

If you try again it usually works.

After reading a lot of posts on this I added transient error handling to the code/resilience using EnableRetryOnFailure() to the DB connection.

I also tried adding Trusted_Connection=False to the connection string.

After this the you could see the connection re-trying multiple times until it worked, sometimes taking 20 seconds or more. Still, maybe 1 in 100 connections it eventually fails with the same error.

We also looked at the TLS_DHE bug https://learn.microsoft.com/en-us/troubleshoot/windows-server/identity/apps-forcibly-closed-tls-connection-errors but the TLS_DHE ciphers are not installed on the server at all.

There's nothing in the event logs on the Windows server, or in the database logs at the time of the error.

Recent changes in the infrastructure: Panda antivirus, moved web apps to a different Azure region.

I've been reading posts on this for days now, mostly really old and slightly different. I'm looking for any ideas of things to try to pinpoint the error. Thanks.

edit: I found some event logs in Microsoft/ServiceBus/Client

HybridConnectionManager Trace: Microsoft.Azure.Relay.RelayException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.WebSockets.WebSocketException: An internal WebSocket error occurred. Please see the innerException, if present, for more details. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult) at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult) --- End of inner exception stack trace ---


Solution

  • Well, this took three months to resolve and it involved our network support team, AWS support, and Azure support.

    I've come back three times to edit this answer. The solution returned on a different server so we tried the fixes that worked on one and they didn't work!

    In Azure Relay/Hyrbid connections, under the connection in question we saw there were TWO listeners, when there should only be one. Each Hybrid Connection Manager you install and connect shows up there as a listener.

    So where was the second listener? Nowhere. It seemed to be a hanging orphan link from a previously deleted connection.

    The only way to delete the phantom listener was to

    • uninstall HCM on the database server
    • remove the connection from all azure apps using it
    • delete the hybrid connection completely in azure
    • recreate the connection in azure afresh
    • reconnect the apps
    • reinstall HCM on the database server
    • connect HCM to the new hybrid connection

    After this we showed one listener under the connection in Azure, and things worked immediately.

    When you have two listeners the data is load balanced between them, so in my case half the time the data was being routed to a non-existent listener and failing. This is why no logs appeared on the database server - it wasn't getting there at all!