Java RMI calls fail for no reason once in a few hundred

I wrote test RMI server and client programs. In the server there is one method which is exposed to the client.

On the client, I am using a 600-thread executor service to call the RMI method 6000 times.

On the server, each method call will create a simple task and submit it to a 300-thread executor service.

I get exceptions just once or twice every execution. So, for 6000 calls, I get about 1 to 3 exceptions. Also, these exceptions seem to happen only during initial ramp up period.

java.rmi.ConnectIOException: Exception creating connection to: ; nested exception is: 
java.net.SocketException: Connection reset by peer
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:631)
at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:129)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
at com.sun.proxy.$Proxy0.receiveMessage(Unknown Source)
at com.example.rmi.MsgTask.run(MsgTask.java:18)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Client and server are running on same machine, same Eclipse IDE.

Looks like if rmi server program is busy for a few milliseconds while a request is received, the request can get dropped. Is this ok? Should I take this behaviour to be normal and build in a 'retry' approach in my RMI clients in future? Or, can I change some settings to make sure that requests are not dropped?

Solution

You're running into the TCP listen backlog. When it fills up, a Windows host will issue a 'connection reset'.

The solution is to either reduce your load or introduce retries, after a small but increasing sleep interval.