Search code examples
fault-tolerancehp-nonstoptandem

How does HP/Tandem NonStop achieve single failure FT without spares?


As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).

This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.

Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?

Thanks a lot!


Solution

  • I developed apps on Tandem systems in 1980s and early 1990s. The Tandem NonStop hardware was essentially tightly multiplexed processors and storage in a single chassis. This could be (and often was) scaled up by adding more chassis. The HW approach is quite interesting in itself, see for example https://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

    However, application-level fault tolerance relied on an API for certain Guardian operating system services. This had to be configured and invoked from application code. Roughly, when an app started in the primary machine, it would request creation of a hot stand-by back-up process on the second machine. Once all this was spun up, the primary app would call the API to take a checkpoint, which would copy the entire state of the primary's process space to the secondary process. Disk was shared and replicated - a forerunner of now commonplace RAIDs. The secondary's OS instance looked for a heartbeat from the primary - if it didn't hear that, it would take over, assuming the either or both the primary's HW or SW was dead. This thumbnail isn't the whole story of course. For details see "Software" in https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2015/Papers/tandem-TR-90.pdf

    This was how ultra-high availability and scalability was achieved for apps running on Tandem stacks.

    But, using Guardian services was not easy. In addition to developing app feature code, it required understanding how each Guardian API call supported the fault-tolerance strategy, and then thinking through when and what to checkpoint, and how to deal with restart edge cases. Devising adequate test suites was a puzzle. All this increased the time, cost, and difficulty of development and testing. A lot of shops thought they were buying all this out of the box (Tandem systems were significantly more expensive than other mid-range systems), and then realized it was too heavy a lift to actually develop NonStop code. As a result, only about 20% of Tandem app code used these capabilities (my best recollection of a study of that.) The rest simply used the dual stacks in simplex mode.

    I haven't yet seen anything that reaches elegance and effectiveness of the Tandem HW/SW architecture, including their container approach ;-) Tandem Coffee Mug