Search code examples
ignite

How to tune Apache Ignite failure detection threshold?


Obviously there's a trade-off between responsiveness and stability - what would be a good guideline here for tuning? In particular, I can imagine there's a big difference between non-virtualized, private virtualized and public virtualized environments in terms of what's achievable. The default of 10 seconds seems quite long to me.


Solution

  • 10 seconds is the smallest realistic value. In fact, it's very easy to have a GC or network glitch which takes more than 10 seconds and segments your node.

    Please note that hitting failure detection timeout would very often mean downtime on the scale of many minutes, if more than one node is considered down. Given the severity, it makes total sense to wait for more than 10 seconds.

    I would recommend 30 seconds for that value.