Search code examples
erlangheartbeat

Debugging Erlang heart timeouts


I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?


Solution

  • You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.

    You can try using the erl_call tool with e.g. -a erlang halt 123.

    If the erlang node can't respond to this is also interesting information.

    Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.