Search code examples
erlangelixirerlang-otp

Why does starting a Supervisor in a GenServer cause problems with process restart behavior?


As the title of the question states:

Why does starting a Supervisor in a GenServer cause problems with process restart behavior?

I found a discussion here that states the following:

Specifically:

  • "There is less guarantee provided by the supervision tree as the process might exit and the supervisor will not have terminated its children."

  • "This can lead to problems if the supervisors children are named because the named children might still exist when a restart occurs higher up the tree (above the process calling start_link in its init/1"

  • "you lose some advanced OTP features, like code reloading, as process modules are discovered by walking the supervision tree"

What's the underlying reason? Does this hold true in general?

References

  1. Related code changes
  2. Github issue

Solution

  • OTP (the supervision tree) is built on top of BEAM features, such as monitoring, linking, signals and trapping them.

    Supervisor are gen_servers themselves, with the only purpose of monitoring/restarting their children and terminating them/dying in a standard way. If you create a gen_server that spawns a supervisor, it means that you something in that level and that a regular supervisor did not make the cut.

    Let's consider this OTP scenario:

             P1 - Parent supervisor
             |
             G1 - GenServer
             |
             S1 - Children supervisor
             |
             C1 - Children worker
    

    Supervisors wait for all their children to have exited before terminating themselves, if you have a gen_server acting as a supervisor (G1) that dies for some reason before all its children (S1) have terminated, the parent may restart the gen_server (G1'). This one will spawn S1' which, in turn, will spawn C1'.

    Suddenly there are several instances of S1 and C1 running at the same time, and this may very well be a problem.

    Regarding the code reload issue mentioned, it means that the code_changed callback tree trigger will stop at G1 (because G1 does not propagate it to S1), not that the code won't be loaded.

    TL;DR: Supervisors are very specialized gen_servers. If you put a regular gen_server in the middle of a supervision tree without providing all the guarantees that the supervisor provides, you lose some of the OTP featrues in that subtree.