Search code examples
javacluster-computingfault-tolerance

How can I make some parts of my application not dependent on failures of another part?


Given we have several services which fetch data from different sources and store it in some predefined format. May be they store the fetched data in some database, or in a file or somewhere else. The idea is that all that services are very similair but they are using different sources.

Before these services were separated into several Java applications.

Now we want to unite these services in one application to share the source code and make it simplier.

A question is: how can we guarantee that one service's failure will never affect another one?

I see several possible ways:

  1. run all tasks in separate threads. do not share some common resource that can be locked by one task. cons: memory issues are not mitigated.

  2. run all tasks in separate JVMs. all risks are mitigated but its more complicated and requires more configuration of the host.

  3. run all tasks on different nodes of the cluster. the most reliable way but the most resource and programmers time consuming.

Any more ideas and suggestions?


Solution

  • How can we guarantee that one service's failure will never affect another one?

    You can't. Certainly, not with a hard guarantee, and with all possible failure modes.

    For example, if one possible failure mode is for a task to go into an infinite loop (or take a finite but very long time), then that is going to affect other tasks, unless you can afford to dedicate an independent computer (or more realistically, JVM) to each task.

    But then we have the problem that tasks probably need to interact with each other, or with a shared database or something. Once you include that, you have problems like:

    • a task failing while holding locks
    • a task failing halfway through updating something
    • a task failing while some other task is waiting for a messages from them,
    • deadlocks and livelocks,
    • networking and hardware failures affecting a subset of your compute nodes.

    There is no magic solution to these problems. Rather, you need to identify the most common failure scenarios, and design your services so that they can (more or less) recover. It is also a good idea to design the system that if there is a failure you don't have to start everything again from the beginning.


    Re your 3 proposed strategies: any of those might be appropriate ... depending on the nature of the tasks, and other application requirements.