java linux bash marklogic marklogic-corb

CORB Job : Handle ServerConnectionException: Connection reset by peer

I am trying to execute a CORB Job to process my documents. But it is throwing up the below exception after processing a part of the entire collection.

com.marklogic.xcc.exceptions.ServerConnectionException: Connection reset by peer
 [Session: user=<username>, cb={default} [ContentSource: <username>, cb={none} [provider: address=<xyz.com>/<IP>, pool=0/64]]]
 [Client: XCC/7.0-2, Server: XDBC/7.0-3.1]
        at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:124)
        at com.marklogic.xcc.impl.SessionImpl.submitRequestInternal(SessionImpl.java:388)
        at com.marklogic.xcc.impl.SessionImpl.submitRequest(SessionImpl.java:371)
        at com.marklogic.developer.corb.Transform.call(Transform.java:68)
        at com.marklogic.developer.corb.Transform.call(Transform.java:1)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

We tried increasing the thread count and the memory allocation, but to no avail.

My requirement is two fold:

What could be the root cause for this? Is there a way to solve the issue?
If not, is there a way too trap this exception in the shell script, that is executing the same?

Solution

There are many possible causes. For example it could be JVM garbage collection, a problem on the server, or even something in the network path. Blindly changing things isn't likely to help: first identify the problem, then correct it.

Most often it's the JVM and GC. MarkLogic XCC implements its own keepalive mechanism, something like HTTP 1.1 keepalive. If garbage collection takes too much time this can lead to timeouts and resets. Try monitoring to see if the JVM seems to be running up against its memory allocation and hitting garbage collection frequently, before the error happens. Adding -verbosegc may also be useful to detect this. If you think GC is the problem, try adding -Xincgc. You may also want to reduce the thread count, to reduce memory pressure. You could also increase the allocation with -Xmx. But don't do that blindly, and I wouldn't go over 1-GiB.

Definitely check ErrorLog.txt and also check the general server health. Is it using any swap space? Is it paging? Anything suspicious in the OS logs? How do the CPU, memory, disk, and network I/O look when the problem occurs?

Every once in a while this kind of thing will turn out to be a firewall or router that doesn't like long-lived connections and shuts them down. If possible arrange it so that your client and server are on the same subnet, with nothing in between except a relatively dumb hub or switch. If there's a local firewall on either host, make sure it won't interfere.