Embedded neo4j crash with no stacktrace

I'm running neo4j 2.3.0-RC1 embedded, using the Java API. It keeps crashing without warning, and I'm trying to figure out why.

I was previously using this code just fine with 1.9.8. Upgrading to 2.0+ has required adding transactions, altering some cypher syntax, the boot-time Spring config, and a small finite number of other changes.

The vast majority of the code remains the same, and is functionally correct as confirmed by unit and integration tests.

When the engine is booted, it is adding new nodes on a fairly constant basis. The output below shows a mysterious crash after 290 minutes.

This seems to happen always. Sometimes after 2 hours, sometimes after 5. It never happened at all with 1.9.8.

The JVM is run with ./start-engine.sh > console.out 2>&1 &.

The operative line of start-engine.sh is

$JAVA_HOME/bin/java -server $JAVA_OPTIONS $JPROFILER_OPTIONS -cp '.:lib/*' package.engine.Main $*

Below is the last few lines of console.out.

17437.902: RevokeBias                       [     112          6              5    ]      [    20     6    27    43    26    ]  1
17438.020: RevokeBias                       [     112          3              9    ]      [     5     0     5     0     0    ]  3
17438.338: GenCollectForAllocation          [     113          2              2    ]      [     1     0    11     4    32    ]  2
17438.857: BulkRevokeBias                   [     112          3             13    ]      [     0     0    28     6     2    ]  3
./start-engine.sh: line 17: 19647 Killed                  $JAVA_HOME/bin/java -server $JAVA_OPTIONS $JPROFILER_OPTIONS -cp '.:lib/*' package.engine.Main $*

There is no stacktrace, and no other error output.

These are the last few lines of messages.log from within /mnt/engine-data

2015-10-30 18:07:44.457+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Starting check pointing...
2015-10-30 18:07:44.458+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Starting store flush...
2015-10-30 18:07:44.564+0000 INFO  [o.n.k.i.s.c.CountsTracker] About to rotate counts store at transaction 845664650 to [/mnt/engine-data/neostore.counts.db.b], from [/mnt/engine-data/neostore.counts.db.a].
2015-10-30 18:07:44.565+0000 INFO  [o.n.k.i.s.c.CountsTracker] Successfully rotated counts store at transaction 845664650 to [/mnt/engine-data/neostore.counts.db.b], from [/mnt/engine-data/neostore.counts.db.a].
2015-10-30 18:07:44.834+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Store flush completed
2015-10-30 18:07:44.835+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Starting appending check point entry into the tx log...
2015-10-30 18:07:44.836+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Appending check point entry into the tx log completed
2015-10-30 18:07:44.836+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Check pointing completed
2015-10-30 18:07:44.836+0000 INFO  [o.n.k.i.t.l.p.LogPruningImpl] Log Rotation [35826]:  Starting log pruning.
2015-10-30 18:07:44.844+0000 INFO  [o.n.k.i.t.l.p.LogPruningImpl] Log Rotation [35826]:  Log pruning complete.

So everything looks fine up until the moment of crash, and the crash comes as a complete surprise.

There is plenty of other data in messages.log, but I don't know what I'm looking for.

$ java -version
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

$uname -a
Linux 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Solution

You may be seeing the effects of the Linux Out-of-Memory Killer, which will kill processes when the system is running critically low on physical memory. This can explain why you found nothing in the logs.

To quote this excellent article:

Because many applications allocate their memory up front and often don't utilize the memory allocated, the kernel was designed with the ability to over-commit memory to make memory usage more efficient. ……… When too many applications start utilizing the memory they were allocated, the over-commit model sometimes becomes problematic and the kernel must start killing processes …

The above-quoted article is a great resource for learning about the OOM Killer, and includes a lot of information on how to troubleshoot and configure Linux to try to avoid the problem.

And to quote the answer to this question:

The OOM Killer has to select the best process to kill. Best here refers to that process which will free up maximum memory upon killing and is also least important to the system.

Since it is very possible that the neo4j process is the most memory-intensive process on your system, it makes sense that it would be killed when physical resources start running out.

One way to avoid the OOM Killer is to try to keep other memory-intensive processes off of the same system. That should make over-commitment of memory much less likely. But you should at least read the first article above, to understand the OOM Killer better -- there is a lot to know.