Search code examples
sql-serverdelphiado

Auto-Recover when DBNETLIB ConnectionWrite General network error causes ADO connections to go offline in Delphi applications?


Googling this ADO error message indicates that it is commonly encountered in ASP.NET development, but I have not found much mention of when it occurs in Delphi applications. We have some customer sites which are experiencing transient network problems, and this is the symptomatic error message. We can duplicate it in office testing easily; Just shut down an MS SQL Server service while your delphi TADOConnection object is connected to a database on that server instance and you get this exception:

   [DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.

Yes, catch this exception, and you know (or do you?) that this error has occurred. Except that this is an 800 KLOC+ application with over 10,000 try-except blocks around database actions, any one of which might fail with this error.

TADOConnection has some error events, none of which fire in this case. However, the ADO Connection itself is faulted once this occurs, even if you restart the SQL database, TADOConnection.Connected remains true, but it's lying to you. It's really in a faulted state.

So then, my question is:

Can you detect this faulted state, and recover from it, in any way that is less work than going into 10,000 individual try-except blocks and setting some global "reconnect ADO global variable"?

I am hoping there is a way to go into TADOConnection.ConnectionObject (the underlying raw OLEDB COM ADO object) and detect this fault condition exists when we are starting a new query, so that we can reset the ADOConnection and continue the next time we run a query. Since our code is organized in a way that would allow us to detect this "after the failure" much more easily than it would allow us to do it the way I would do this in a 10 line demo application.

This other SO question asks why it happens, that is not what I'm asking, please don't give me "prevention" answers, I know about them already, I'm looking for a recovery and detection-of-stalled-ADO-connection technique other than catching the exceptions. In fact, this is a good example of exceptions gone wrong; ADO is a schrodingers-cat object in this failure mode.

I am aware of the MS Knowledgebase articles, and the various solutions floating around the internet. I'm asking about RECOVERING without losing customer data, once the error condition (which is often transient in our situations) has cleared. That means we freeze our app, show the exception to the customer, and when the customer clicks Retry or Continue, we attempt to repair and continue. note that our existing code does a million try-except-log-and-continue code, that is going to get in our way, so I'm expecting someone to answer that an Application handler for unhandled exceptions is the best way, but sadly we can't use it. I really hope however that it is possible to detect a frozen/faulted/dead ADO connection object.

Here's what I have:

try
  if fQueryEnable and ADOConnection1.Connected then begin
    qQueryTest1.Active := false;
    qQueryTest1.Active := true;
    Inc(FQryCounter);
    Label2.Caption := IntToStr(qQueryTest1.RecordCount)+' records';

  end;
except
      on E:Exception do begin
         fQueryEnable := false;
         Memo1.Lines.Add(E.ClassName+' '+E.Message);
         if E is EOleException and Pos('DBNETLIB',E.Message)>0 then begin
            ADOConnectionFaulted := boolean; { Global variable. }
         end;
         raise;
      end;
end;

The problem with the above solution is that I need to copy and paste it about 10,000 places in my application.


Solution

  • Well nobody has answered this question, and I think that some follow-up would be helpful.

    Here is what I have learned:

    • There are no reliable situations where in a test environment you can reproduce this General Network Error. That is to say, we're dealing with Irreproducible Results, which is where many developers leap into evil hackery in an attempt to "monkeypatch" their broken systems.

    • Fixing the underlying fault has always and everywhere been better than fixing it in code, when the SQL library gives a "General Network Error". No repair has ever been shown to be possible, because usually it means "the network is so unreliable that TCP itself has given up on delivering my data", this happens when:

      • You have a bad network cable.

      • You have duplicate IP addresses on a network.

      • You have dueling DHCP servers each handling out different default gateways.

      • You have local ethernet segments that have poor connectivity between them.

      • You have an ethernet switch or hub which is failing.

      • You are being intermittently blocked by a malfunctioning firewall.

      • Your customer may have changed something on their network, and may now be unable to use your software. (This last one actually happens more than you might think)

      • Someone may have configured an SQL Alias using cliconfg or other client side configuration elements that are specific to a single workstation's registry settings, and this local configuration may result in bad behaviour that is difficult to diagnose and may be limited to one or several workstations on a large network.

    None of the above can be detected and reported either at the TCP or SQL level. When SQL finally gives up, and it gives this "General Network Error", no amount of cajoling from my software is going to get it to un-give-up, and even if it did, I would be doing a "try/except/ignore" antipattern. This error is so severe that we should raise it all the way up to the user, log it to disk in an error log, give up (quit the program), and tell the user that the network connection is down.