Search code examples
moduleerlangerlang-supervisor

Why does my supervisor fail on start_child with undef?


I am trying to run a simple_one_for_one supervisor where the supervisor and worker are placed in separate modules and i keep getting the below error when using supervisor:start_child:

>A=sup:start_link().
>B=supervisor:start_child(A,[]).
{error,{'EXIT',{undef,[{worker,start_link,[],[]},
                       {supervisor,do_start_child_i,3,
                                   [{file,"supervisor.erl"},{line,379}]},
                       {supervisor,handle_call,3,
                                   [{file,"supervisor.erl"},{line,404}]},
                       {gen_server,try_handle_call,4,
                                   [{file,"gen_server.erl"},{line,661}]},
                       {gen_server,handle_msg,6,
                                   [{file,"gen_server.erl"},{line,690}]},
                       {proc_lib,init_p_do_apply,3,
                                 [{file,"proc_lib.erl"},{line,249}]}]}}}

Supervisor

-module(sup).
-behaviour(supervisor).
-compile([export_all]).

start_link()->
    {ok,Pid}=supervisor:start_link(?MODULE,[]),
    io:format("sugi pl"),
    Pid.



init(_Args) ->
     RestartStrategy = {simple_one_for_one, 10, 60},
     ChildSpec = {
                  worker, 
                  {worker, start_link, []},  //tried adding here a parameter in the A
                  permanent,
                  brutal_kill, 
                  worker,
                  [sup]
                },
    {ok, {RestartStrategy,[ChildSpec]}}.

Worker

-module(worker).
-compile(export_all).


start_link([Arg])->  //tried both [Arg] and Arg
    {ok,Pid}=spawn_link(?MODULE,init,[]),
    Pid.


init([Time])->
    receive->
        {From,Msg}->From !{Time,Msg},
                     init(Time)
    end.

Command

>c("[somepath]/sup.erl"),A=sup:start_link(),B=supervisor:start_child(A,[]).

I can see clearly that the problem is when trying to add a child.Somehow the init function is not called properly but i do not see why.(Badmatch)
I have tried adding a parameter in the A of the MFA of the ChildSpec to no avail


Solution

  • There are a number of problems with your code.

    • The return value of sup:start_link/0 is wrong; you're returning Pid instead of {ok, Pid}.
    • While it's not really incorrect, you're using supervisor:start_link/2, which doesn't register the supervisor name. Having the name is handy, so it's better to use supervisor:start_link/3:

      supervisor:start_link({local, ?MODULE}, ?MODULE, []),
      

      This associates the module name with its process id, allowing you to use the process name in your shell commands rather than using pid variables.

    • You have an io:format/2 call in sup:start_link/0, presumably for debugging. A much better way of debugging is to call sys:trace(sup, true) from your shell once you've started the sup supervisor. You can turn it off from the shell as well, by specifying false instead of true as the second argument.

    Fixing the above issues leaves use with the following definition of sup:start_link/0:

    start_link() ->
        supervisor:start_link({local, ?MODULE}, ?MODULE, []).
    

    Let's recompile, start the supervisor, then compile worker (after fixing its syntax errors), and then trace the supervisor as we try to start a child:

    1> c(sup).
    sup.erl:3: Warning: export_all flag enabled - all functions will be exported
    {ok,sup}
    2> sup:start_link().
    {ok,<0.94.0>}
    3> sys:trace(sup, true).
    ok
    4> c(worker).
    worker.erl:2: Warning: export_all flag enabled - all functions will be exported
    worker.erl:5: Warning: variable 'Arg' is unused
    {ok,worker}
    5> supervisor:start_child(sup, []).
    *DBG* sup got call {start_child,[]} from <0.81.0>
    *DBG* sup sent {error,
                       {'EXIT',
                           {undef,
                               [{worker,start_link,[],[]},
                                {supervisor,do_start_child_i,3,
                                    [{file,"supervisor.erl"},{line,379}]},
    ...
    

    This abbreviated trace output shows that sup died when it tried to start a worker by calling worker:start_link/0 (the [] indicates there are zero arguments). The child specification tells sup to start it this way, since it contains

    {worker, start_link, []}
    

    and we started the child via supervisor:start_child(sup, []). For a simple_one_for_one child, the arguments sent to its start function are composed of the argument list from the child specification combined with the arguments specified in the call to supervisor:start_child/2; in this case, that's the equivalent of [] ++ [] which is the same as [], indicating no arguments. Let's change the worker:start_link/1 function to worker:start_link/0 instead, recompile it, and try again:

    6> c(worker).
    worker.erl:2: Warning: export_all flag enabled - all functions will be exported
    {ok,worker}
    7> supervisor:start_child(sup, []).
    *DBG* sup got call {start_child,[]} from <0.81.0>
    *DBG* sup sent {error,
                       {'EXIT',
                           {{badmatch,<0.94.0>},
                            [{worker,start_link,0,[{file,"worker.erl"},{line,6}]},
    ...
    

    This time, the abbreviated output shows a badmatch. This is because spawn_link/3 returns a pid, but worker:start_link/0 is expecting it to return {ok, Pid}. Let's fix that, and also fix the return value to be {ok, Pid} instead of just Pid:

    start_link()->
        Pid = spawn_link(?MODULE,init,[]),
        {ok, Pid}.
    

    Then let's recompile and try again:

    8> c(worker).
    worker.erl:2: Warning: export_all flag enabled - all functions will be exported
    {ok,worker}
    9> supervisor:start_child(sup, []).
    *DBG* sup got call {start_child,[]} from <0.81.0>
    *DBG* sup sent {ok,<0.106.0>} to <0.81.0>, new state {state,
                                                          {local,sup},
                                                          simple_one_for_one,
                                                          {[worker],
                                                           #{worker =>
                                                              {child,undefined,
                                                               worker,
                                                               {worker,
                                                                start_link,[]},
                                                               permanent,
                                                               brutal_kill,worker,
                                                               [sup]}}},
                                                          {maps,
                                                           #{<0.106.0> => []}},
                                                          10,60,[],0,sup,[]}
    *DBG* sup got {'EXIT',<0.106.0>,{undef,[{worker,init,[],[]}]}}
    

    OK, this time the supervisor actually started the child, but it died immediately because it's trying to call worker:init/0 but only worker:init/1 is defined. Because the child dies immediately, the supervisor tries repeatedly to start it, based on its restart strategy:

    RestartStrategy = {simple_one_for_one, 10, 60},
    

    Because this is a hard error, it fails instantly every time, and the supervisor dies after 10 failed restarts in 60 seconds or less, just like it's supposed to:

    =SUPERVISOR REPORT==== 20-Apr-2020::10:43:43.557307 ===
        supervisor: {local,sup}
        errorContext: shutdown
        reason: reached_max_restart_intensity
        offender: [{pid,<0.117.0>},
                   {id,worker},
                   {mfargs,{worker,start_link,[]}},
                   {restart_type,permanent},
                   {shutdown,brutal_kill},
                   {child_type,worker}]
    ** exception error: shutdown
    

    It appears from your code that you're attempting to pass some sort of Time argument to worker:init/1, so let's change start_link/0 to pass a timestamp:

    start_link()->
        Pid = spawn_link(?MODULE,init,[os:timestamp()]),
        {ok, Pid}.
    

    Let's also fix init/1 to take the argument directly, rather than in a list:

    init(Time) ->
        receive
            {From,Msg} ->
                From ! {Time,Msg},
                init(Time)
    end.
    

    Let's restart the supervisor, recompile worker, and try again:

    10> sup:start_link().
    {ok,<0.119.0>}
    11> sys:trace(sup, true).
    ok
    12> c(worker).
    worker.erl:2: Warning: export_all flag enabled - all functions will be exported
    {ok,worker}
    13> {ok, Child} = supervisor:start_child(sup, []).
    *DBG* sup got call {start_child,[]} from <0.118.0>
    *DBG* sup sent {ok,<0.127.0>} to <0.118.0>, new state {state,
                                                           {local,sup},
                                                           simple_one_for_one,
                                                           {[worker],
                                                            #{worker =>
                                                               {child,undefined,
                                                                worker,
                                                                {worker,
                                                                 start_link,[]},
                                                                permanent,
                                                                brutal_kill,
                                                                worker,
                                                                [sup]}}},
                                                           {maps,
                                                            #{<0.127.0> => []}},
                                                           10,60,[],0,sup,[]}
    {ok,<0.127.0>}
    

    It looks like it worked. Let's see if the supervisor agrees, by asking it how many children it has:

    14> supervisor:count_children(sup).
    ...
    [{specs,1},{active,1},{supervisors,0},{workers,1}]
    

    It has one worker, just as we expected. Finally, let's send a message to the worker and see if it responds as expected:

    15> Child ! {self(), "are you there?"}.
    {<0.118.0>,"are you there?"}
    16> flush().
    Shell got {{1587,394860,258120},"are you there?"}
    ok
    

    Now it all seems to work.

    One final fix is to change the modules in your child specification; rather than [sup] it should be the module itself, [worker]. With that change, your revised working modules are below. You might also want to reconsider whether you want to use permanent for the children, since this is a simple_one_for_one supervisor; transient is likely a better choice, but I've left it as originally written. Consider reviewing the Supervisor Behavior documentation for more information.

    Supervisor

    -module(sup).
    -behaviour(supervisor).
    -compile([export_all]).
    
    start_link()->
        supervisor:start_link({local, ?MODULE}, ?MODULE, []).
    
    init(_Args) ->
         RestartStrategy = {simple_one_for_one, 10, 60},
         ChildSpec = {
                      worker,
                      {worker, start_link, []},
                      permanent,
                      brutal_kill,
                      worker,
                      [worker]
                    },
        {ok, {RestartStrategy,[ChildSpec]}}.
    

    Worker

    -module(worker).
    -compile(export_all).
    
    start_link()->
        Pid = spawn_link(?MODULE,init,[os:timestamp()]),
        {ok, Pid}.
    
    init(Time) ->
        receive
            {From,Msg} ->
                From ! {Time,Msg},
                init(Time)
        end.