When running a distributor that is relatively busy with 5+ workers from time to time the service stops processing messages.
NServiceBus version is 3.3.8 with a few backports of fixes from 4.x (regarding MessagePropertyFilters, Locking regarding ClearAvailabilityForWorkers and a few minor issues) the source code is available here:
https://github.com/PeterLehmann/NServiceBus/tree/undoWfpNameChange
Analyzing a hanging service with WinDBG shows that most threads is waiting for memory allocation or freeing like this:
System.Runtime.InteropServices.GCHandle.InternalFree(IntPtr)
System.Runtime.InteropServices.GCHandle.Free()
System.Messaging.Interop.MessagePropertyVariants.Unlock()
System.Messaging.MessageQueue.ReceiveCurrent(System.TimeSpan, Int32, System.Messaging.Interop.CursorHandle, System.Messaging.MessagePropertyFilter, System.Messaging.MessageQueueTransaction, System.Messaging.MessageQueueTransactionType)
System.Messaging.MessageQueue.Peek(System.TimeSpan)
NServiceBus.Unicast.Queuing.Msmq.MsmqMessageReceiver.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.Process()
NServiceBus.Utils.WorkerThread.Loop()
System.Threading.ExecutionContext.runTryCode(System.Object)
and
System.Runtime.InteropServices.GCHandle.InternalAlloc(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Runtime.InteropServices.GCHandle..ctor(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Messaging.Interop.MessagePropertyVariants.Lock()
System.Messaging.MessageQueue.ReceiveCurrent(System.TimeSpan, Int32, System.Messaging.Interop.CursorHandle, System.Messaging.MessagePropertyFilter, System.Messaging.MessageQueueTransaction, System.Messaging.MessageQueueTransactionType)
System.Messaging.MessageQueue.Peek(System.TimeSpan)
NServiceBus.Unicast.Queuing.Msmq.MsmqMessageReceiver.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.Process()
NServiceBus.Utils.WorkerThread.Loop()
System.Threading.ExecutionContext.runTryCode(System.Object)
We have multiple threads with the above two stacks all seem to be hanging and messages isn't being processed. And then we have single thread also waiting for InternalAlloc that is in the middle of Raven communication
System.Runtime.InteropServices.GCHandle.InternalAlloc(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Runtime.InteropServices.GCHandle..ctor(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Net.SafeDeleteContext.InitializeSecurityContext(System.Net.SecurDll, System.Net.SafeFreeCredentials ByRef, System.Net.SafeDeleteContext ByRef, System.String, System.Net.ContextFlags, System.Net.Endianness, System.Net.SecurityBuffer, System.Net.SecurityBuffer[], System.Net.SecurityBuffer, System.Net.ContextFlags ByRef)
System.Net.SSPIAuthType.InitializeSecurityContext(System.Net.SafeFreeCredentials, System.Net.SafeDeleteContext ByRef, System.String, System.Net.ContextFlags, System.Net.Endianness, System.Net.SecurityBuffer[], System.Net.SecurityBuffer, System.Net.ContextFlags ByRef)
System.Net.SSPIWrapper.InitializeSecurityContext(System.Net.SSPIInterface, System.Net.SafeFreeCredentials, System.Net.SafeDeleteContext ByRef, System.String, System.Net.ContextFlags, System.Net.Endianness, System.Net.SecurityBuffer[], System.Net.SecurityBuffer, System.Net.ContextFlags ByRef)
System.Net.NTAuthentication.GetOutgoingBlob(Byte[], Boolean, System.Net.SecurityStatus ByRef)
System.Net.NTAuthentication.GetOutgoingBlob(System.String)
System.Net.NegotiateClient.DoAuthenticate(System.String, System.Net.WebRequest, System.Net.ICredentials, Boolean)
System.Net.NegotiateClient.Authenticate(System.String, System.Net.WebRequest, System.Net.ICredentials)
System.Net.AuthenticationManager.Authenticate(System.String, System.Net.WebRequest, System.Net.ICredentials)
System.Net.AuthenticationState.AttemptAuthenticate(System.Net.HttpWebRequest, System.Net.ICredentials)
System.Net.HttpWebRequest.CheckResubmitForAuth()
System.Net.HttpWebRequest.CheckResubmit(System.Exception ByRef)
System.Net.HttpWebRequest.DoSubmitRequestProcessing(System.Exception ByRef)
System.Net.HttpWebRequest.ProcessResponse()
System.Net.HttpWebRequest.SetResponse(System.Net.CoreResponseData)
System.Net.ConnectStream.ProcessWriteCallDone(System.Net.ConnectionReturnResult)
System.Net.ConnectStream.CallDone(System.Net.ConnectionReturnResult)
System.Net.ConnectStream.WriteHeaders(Boolean)
System.Net.HttpWebRequest.EndSubmitRequest()
System.Net.Connection.SubmitRequest(System.Net.HttpWebRequest, Boolean)
System.Net.ServicePoint.SubmitRequest(System.Net.HttpWebRequest, System.String)
System.Net.HttpWebRequest.SubmitRequest(System.Net.ServicePoint)
System.Net.HttpWebRequest.GetResponse()
Raven.Client.Connection.HttpJsonRequest.ReadStringInternal(System.Func`1<System.Net.WebResponse>)
Raven.Client.Connection.HttpJsonRequest.ReadResponseString()
Raven.Client.Connection.HttpJsonRequest.ReadResponseJson()
Raven.Client.Connection.ServerClient.DirectCommit(System.Guid, System.String)
Raven.Client.Connection.ServerClient+<>c__DisplayClass5b.<Commit>b__5a(System.String)
Raven.Client.Connection.ServerClient.TryOperation[[System.__Canon, mscorlib]](System.Func`2<System.String,System.__Canon>, System.String, Boolean, System.__Canon ByRef)
Raven.Client.Connection.ServerClient.ExecuteWithReplication[[System.__Canon, mscorlib]](System.String, System.Func`2<System.String,System.__Canon>)
Raven.Client.Connection.ServerClient.Commit(System.Guid)
Raven.Client.Document.DocumentSession.Commit(System.Guid)
Raven.Client.Document.RavenClientEnlistment.Commit(System.Transactions.Enlistment)
System.Transactions.Oletx.OletxEnlistment.CommitRequest()
...
We have tried variations of Concurrent and Servermode garbage collection and it looks like the issue happend when Garbage Collection is taking place but only sometimes, othertimes we see GC running perfectly and normal throughput isn't affected.
I checked to see if something was blocking the finalizer but it doesn't seem to be the case, did some digging to find if something in the memoryallocation/deallocation could deadlock but doesn't seem to be able to find anything (I could dig deeper here but would like some input first).
Currently operations is restarting the services when this happens however even this fails sometimes and services get stuck in "Stopping" state sometimes and have to be killed hard to get them to start processing again.
So have anyone else experienced hanging services with nServiceBus 3.3.8?
So to answer my own question, we were running .NET 4.0.30319 on the system after upgrading to .NET 4.5.1 the above problem seem to have vanished.
I suspect that there has been some updates regarding memory-handling and garbage collection that has solved the issue.