I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?
You could try to call erlang:halt/1
from your HEART_COMMAND
thus creating a crash dump from the unresponsive node.
You can try using the erl_call
tool with e.g. -a erlang halt 123
.
If the erlang node can't respond to this is also interesting information.
Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.