Search code examples
storagedistributeddistributed-computing

What happens when all replicas of a piece of data fail in a data-center?


Distributed storage architectures in modern-day data centers are designed such that there are 2-3 replicas of each piece of data, so that it is still available when a machine fails.

As I understand it, there is still a non-zero probability of all replicas failing, and given the scale of operations, there must be instances where this may happen. How do large data centers protect against this kind of failure, especially when it's important data, like your email, or images? Even further redundancy can only further make such failures unlikely, but not impossible.


Solution

  • NYC Tech Talk Series: How Google Backs Up the Internet is a good explanation of how Google manages backing up and achieving reliability. A text-based explanation is here.

    Most importantly the talk says the following:

    • Redundancy is not a guarantee of integrity or recoverability.
    • Tape is not obsolete.
    • Isolation needs to be ensured in several different verticals: location, application layer problems, storage layer problems, media failure etc.
    • Continuous backup and restoration, reading and writing from tapes even before there is a need for restoration.
    • Automate steady-state operations as much as possible.
    • Expect failures at a particular rate, investigate if the rate of failure changes.

    Again, as the other answer says, it is only possible to cover all bases and ensure that the probability is so low and the window of data-loss (between one backup failing and being rebuilt from other backups) is extremely low.