jboss elasticsearch lucene jvm jvm-hotspot

JVM crashes frequently

JVM crashes surprizingly and frequently on our prod environment and results in Jboss (EAP6.3) going down. We have java7 U72 installed

Crash logs has same output where current thread is:

Current thread (0x00000000d1d99000): JavaThread "Lucene Merge Thread #0" daemon [_thread_in_Java, id=1144, stack(0x00000000f6a00000,0x00000000f6b00000)]

and all the log is full of :

JavaThread "elasticsearch[Node BD852E44][search][T#68]" daemon [_thread_blocked, id=14396, stack(0x00000000f7b30000,0x00000000f7c30000)]

elasticsearch is some were related to indexing and it uses Lucene in hood as far as I understand but we have number or application deployed how to check on this can someone please help. complete crash logs are at : http://pastebin.com/845LU9iK

Solution

Looks like it didn't manage to record stack traces for the affected thread. If that's the same for all crashes then it doesn't seem to match known lucene or jboss bugs.

#  guarantee(result == EXCEPTION_CONTINUE_EXECUTION) failed: Unexpected result from topLevelExceptionFilter

AIUI this indicates an error in native exception handling, so it's one error masking another, probably making this crash log fairly useless.

So I can only provide really generic advice:

you're using an older JVM version, update to the latest java 7, java 8 or possibly even a java 9 dev build and see if it goes away.
Even if they still crash they might provide different/more useful error reports
to diagnose potential compiler bugs you can try running with the following flags
- -XX:-TieredCompilation ¹ should disable the C1 compiler
- -XX:+TieredCompilation -XX:TieredStopAtLevel=1 should disable the C2 compiler
- -Xint disables all JIT, very slow
ask on the hotspot-dev mailing list for further guidance

¹: Tiered compilation is a new java 7 feature, it basically combines the interpreter, C1 and C2 JIT compilers (which formerly were used separately in the client and server VMs) into different optimizing stages.

Each of them can have optimization bugs. Turning off individual stages helps isolating them as potential cause.

Edit: The new crash report is more useful since it at least has java frames, the interesting part is the following:

J 1559  sun.misc.Unsafe.getByte(J)B (0 bytes) @ 0x000000000178e99b [0x000000000178e960+0x3b]
j  java.nio.DirectByteBuffer.get()B+11
j  org.apache.lucene.store.ByteBufferIndexInput.readByte()B+4
J 9447 C2 org.apache.lucene.store.DataInput.readVInt()I (114 bytes) @ 0x000000000348cc00 [0x000000000348cbc0+0x40]

DataInput.readVInt seems to be an ongoing source of grief, see this SO answer for possible solutions