Search code examples
c#wcffaultreliablesession

Unexpected fault on ReliableSession in NetTcpBinding (WCF)


I have a client server application. My scenario:

  • .Net Framework 4.6.1
  • Quad Core i7 machine with hyperthreading enabled
  • Server CPU load from 20 - 70 %
  • Network load < 5% (GBit NIC)
  • 100 users
  • 30 services (some administrative ones, some generic ones per datatype) running and each user is connected to all services
  • NetTcpBinding (compression enabled)
  • ReliableSession enabled
  • each second I do trigger (server side) an update notification and all clients load from the server approx. 100 kB
  • additionally a heartbeat is running (for testing 15 seconds interval) which simply returns the server time in UTC

Sometimes the WCF connections change to faulted state. Usually when this happens the server has no network upstream at all. I did write a memory dump and was able to see that lots of WCF threads were waiting for some WaitQueue. The call stack is:

Server stack trace: 
   at System.ServiceModel.Channels.TransmissionStrategy.WaitQueueAdder.Wait(TimeSpan timeout)
   at System.ServiceModel.Channels.TransmissionStrategy.InternalAdd(Message message, Boolean isLast, TimeSpan timeout, Object state, MessageAttemptInfo& attemptInfo)
   at System.ServiceModel.Channels.ReliableOutputConnection.InternalAddMessage(Message message, TimeSpan timeout, Object state, Boolean isLast)
   at System.ServiceModel.Channels.ReliableDuplexSessionChannel.OnSend(Message message, TimeSpan timeout)
   at System.ServiceModel.Channels.DuplexChannel.Send(Message message, TimeSpan timeout)
   at System.ServiceModel.Dispatcher.DuplexChannelBinder.Send(Message message, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

I did tweak the settings and it seems that the situation is eased - Now there are faulting less clients. My settings:

  • ReliableSession.InactivityTimeout: 01:30:00
  • ReliableSession.Enabled: True
  • ReliableSession.Ordered: False
  • ReliableSession.FlowControlEnabled: False
  • ReliableSession.MaxTransferWindowSize: 4096
  • ReliableSession.MaxPendingChannels: 16384
  • MaxReceivedMessageSize: 1073741824
  • ReaderQuotas.MaxStringContentLength: 8388608
  • ReaderQuotas.MaxArrayLength: 1073741824

I am stuck. Why do all calls try to wait for some WaitQueue in the TransmissionStrategy? I do not care about messages being sent out of order (I do take care of that myself). I was already thinking about disabling reliable messaging but the application is used in a company network worldwide. I need to know that my messages were delivered.

Any ideas how to teach WCF to just send the messages and do not care about anything else?

EDIT

The values for service throttling are set to Int32.MaxValue.

I did also try to set MaxConnections and ListenBackLog (on NetTcpBinding) to their maximum values. It did not change anything - as far as I can tell.

EDIT 2

Checking the WCF Traces it tells me (German message, therefore a rough translation) that there is no available space in the reliable messaging transfer window - and then all I get are Timeouts because no more messages are sent.

Whats going on there? Is it possible that the reliable messaging confuses itself?


Solution

  • Long story short:

    It turns out that my WCF settings are just fine.

    The ThreadPool is the limiting factor. In high traffic (and therefore high load) situations I do generate to much messages which have to be sent to the clients. Those are queued up as there are not enough worker threads to send the messages. At some point the queue is full - and there you are.

    For more details check this question & answer from Russ Bishop.

    Interesting detail: This did even decrease the CPU load in high traffic situations. From spiking crazy between 30 and 80 percent to a(n) (almost) steady value around 30 percent. I can only assume that is is because of threadpool thread generation and cleanup.

    EDIT

    I did the following:

    ThreadPool.SetMinThreads(1000, 500)
    

    That values might be like using a sledgehammer to crack a nut - but it works.