Search code examples
performancechroniclechronicle-queue

Chronicle queue hard power failure recovery


When writing to Chronicle Queue, the default write doesn't flush to disk, so I believe anything that is in the linux kernel dirty page cache is lost. What's the best approach to get guaranteed recovery in the event of power failure? Would a battery backed raid array along with enforced flush on write be a good approach? Or is it better to use replication with an ack from the second machine before assuming the write is safely recorded? Which of these approaches would have the best performance? Theoretically the power failure could affect both machines if on the same power grid....


Solution

  • anything that is in the linux kernel dirty page cache is lost.

    Yes

    What's the best approach to get guaranteed recovery in the event of power failure?

    Replicate the data to a second or third machine. That way even if the whole machine/data centre can't be recovered you can continue operation without data loss.

    Would a battery backed raid array along with enforced flush on write be a good approach?

    You have to trust the reliability of the hardware, something Chronicle can't guarantee and many of our clients have been burnt on before.

    Or is it better to use replication with an ack from the second machine before assuming the write is safely recorded?

    It depends on your requirements. This is best practice in our opinion, though many clients don't feel they need this option.

    Another approach is to replicate the data to a secondary machine and have the secondary process the data. This can halve network latency introduced.

    Which of these approaches would have the best performance?

    The best performance is to assume a manual process will used in the event of a failure and be willing to accept a small loss. In this case, you process everything as soon as possible.

    Note: There are some alternatives.

    • You can wait for ack for only the critical messages, other message types could be processed immediately.
    • You can allow a window where you process messages if no more than N haven't been acknowledged.

    Theoretically the power failure could affect both machines if on the same power grid....

    This is where 2+1 replication might be an option. One backup server nearby to recover normal operation in the event of the failure of a rack or part of one. AN a second backup off site, which is slower to replicate but has fair less chance of also failing.