Search code examples
erlanggen-servererlang-supervisor

Erlang: Cannot start supervisor on another node


I have a simple supervisor that looks like this

-module(a_sup).
-behaviour(supervisor).

%% API
-export([start_link/0, init/1]).

start_link() ->
  supervisor:start_link({local,?MODULE}, ?MODULE, []).

init(_Args) ->
  RestartStrategy = {simple_one_for_one, 5, 3600},
  ChildSpec = {
    a_gen_server,
    {a_gen_server, start_link, []},
    permanent,
    brutal_kill,
    worker,
    [a_gen_server]
  },
  {ok, {RestartStrategy,[ChildSpec]}}.

When I run this on the shell, it works perfectly fine. But now I want to run different instances of this supervisor on different nodes, called foo and bar (started as erl -sname foo and erl -sname bar, from a separate node called main erl -sname main). This is how I try to initiate this rpc:call('foo@My-MacBook-Pro', a_sup, start_link, [])., but after replying with ok it immediately fails with this message

{ok,<9098.117.0>}
=ERROR REPORT==== 7-Mar-2022::16:05:45.416820 ===
** Generic server a_sup terminating 
** Last message in was {'EXIT',<9098.116.0>,
                               {#Ref<0.3172713737.1597505552.87599>,return,
                                {ok,<9098.117.0>}}}
** When Server state == {state,
                            {local,a_sup},
                            simple_one_for_one,
                            {[a_gen_server],
                             #{a_gen_server =>
                                   {child,undefined,a_gen_server,
                                       {a_gen_server,start_link,[]},
                                       permanent,false,brutal_kill,worker,
                                       [a_gen_server]}}},
                            {maps,#{}},
                            5,3600,[],0,never,a_sup,[]}
** Reason for termination ==
** {#Ref<0.3172713737.1597505552.87599>,return,{ok,<9098.117.0>}}

(main@Prachis-MacBook-Pro)2> =CRASH REPORT==== 7-Mar-2022::16:05:45.416861 ===
  crasher:
    initial call: supervisor:a_sup/1
    pid: <9098.117.0>
    registered_name: a_sup
    exception exit: {#Ref<0.3172713737.1597505552.87599>,return,
                     {ok,<9098.117.0>}}
      in function  gen_server:decode_msg/9 (gen_server.erl, line 481)
    ancestors: [<9098.116.0>]
    message_queue_len: 0
    messages: []
    links: []
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 29
    reductions: 425
  neighbours:

From the message it looks like the call expects the supervisor to be a gen_server instead? And when I try to initiat a gen_server on the node like this, it works out just fine, but not with supervisors. I can't seem to figure out if there's something different in trying to initiate supervisor on local/remote nodes, and if yes, what should we do to fix the issue?


Solution

  • As per @JoséM's suggestion, the supervisor in the remote node is also linked to the ephemeral RPC process. However since supervisor does not provide a start method, modifying the start_link() method as

    start_link() ->
      Pid = supervisor:start_link({local,?MODULE}, ?MODULE, []).
      unlink(Pid),
      {ok, Pid}.
    

    solves the issue.