Search code examples
erlangerlang-otperlang-supervisorgen-server

Erlang: how to deal with long running init callback?


I have a gen_server that when started attempts to start a certain number of child processes (usually 10-20) under a supervisor in the supervision tree. The gen_server's init callback invokes supervisor:start_child/2 for each child process needed. The call to supervisor:start_child/2 is synchronous so it doesn't return until the child process has started. All the child processes are also gen_servers, so the start_link call doesn't return until the init callback returns. In the init callback a call is made to a third-party system, which may take a while to respond (I discovered this issue when calls to a third-party system were timing out after 60 seconds). In the meantime the init call has blocked, meaning the supervisor:start_child/2 is also blocked. So the whole time the gen_server process that invoked supervisor:start_child/2 is unresponsive. Calls to the gen_server timeout while it is waiting the on the start_child function to return. Since this can easily last for 60 seconds or more. I would like to change this as my application is suspended in a sort of half started state while it is waiting.

What is the best way to resolve this issue?

The only solution I can think of is to move the code that interacts with the third-party system out of the init callback and into a handle_cast callback. This would make the init callback faster. The disadvantage is that I would need to call gen_server:cast/2 after all the child processes have been started.

Is there a better way of doing this?


Solution

  • One approach I've seen is use of timeout init/1 and handle_info/2.

    init(Args) ->
      {ok, {timeout_init, Args} = _State, 0 = _Timeout}.
    
    
    ...
    
    
    handle_info( timeout, {timeout_init, Args}) ->
       %% do your inicialization
       {noreply, ActualServerState};  % this time no need for timeout 
    
    handle_info( .... 
    

    Almost all results you can be returned with additional timeout parameter, which is basically time to wait for a another message. It given time passes the handle_info/2 is called, with timeout atom, and servers state. In our case, with timeout equal to 0, the timeout should occur even before gen_server:start finishes. Meaning that handle_info should be called even before we are able to return pid of our server to anyone else. So this timeout_init should be first call made to our server, and give us some assurance, that we finish initialization, before handling anything else.

    If you don't like this approach (is not really readable), you might try to send message to self in init/1

    init(Args) ->
       self() ! {finish_init, Args},
       {ok, no_state_yet}.
    
    ...
    
    
    handle_info({finish_init, Args} = _Message, no_state_yet) ->
       %% finish whateva 
       {noreply, ActualServerState};
    
    handle_info(  ... % other clauses 
    

    Again, you are making sure that message to finish initialization is send as soon as possible to this server, which is very important in case of gen_servers which register under some atom.


    EDIT After some more careful study of OTP source code.

    Such approach is good enough when you communicate with your server trough it's pid. Mainly because pid is returned after your init/1 functions returns. But it is little bit different in case of gen_.. started with start/4 or start_link/4 where we automatically register process under same name. There is one race condition you could encounter, which I would like to explain in little more detail.

    If process is register one usually simplifies all calls and cast to server, like:

    count() ->
       gen_server:cast(?SERVER, count).
    

    Where ?SERVER is usually module name (atom) and which will work just fine untill under this name is some registered (and alive) process. And of course, under the hood this cast is standard Erlang's message send with !. Nothing magical about it, almost the same as you do in your init with self() ! {finish ....

    But in our case we assume one more thing. Not just registration part, but also that our server finished it's initialization. Of course since we are dealing with message box, it is not really important how long something takes, but it is important which message we receive firs. So to be exact, we would like to receive finish_init message before receiving count message.

    Unfortunately such scenario could happened. This is due to fact that gen's in OTP are registered before init/1 callback is called. So in theory while one process calls start function which will go up to registration part, than another one could find our server and send count message, and just after that the init/1 function would be called with finish_init message. Chances are small (very, very small), but still it could happen.

    There are three solutions to this.

    First would be to do nothing. In case of such race condition the handle_cast would fail (due to function clause, since our state is not_state_yet atom), and supervisor would just restart whole thing.

    Second case would be ignoring this bad message/state incident. This is easily achieved with

       ... ;
    handle_cast( _, State) -> 
       {noreply, State}.
    

    as your last clause. And unfortunately most people using templates use such unfortunate (IMHO) pattern.

    In both of those you maybe could lose one count message. If that is really a problem you still could try to fix it by changing last clause to

       ... ;
    handle_cast(Message, no_state_yet) -> 
       gen_server:cast( ?SERVER, Message),
       {noreply, no_state_yet}.
    

    but this have other obvious advantages, an I would prefer "let it fail" approach.

    Third option is registering process little bit later. Rather than using start/4 and asking for automatic registration, use start/3, receive pid, and register it yourself.

    start(Args) ->
       {ok, Pid} = gen_server:start(?MODULE, Args, []),
       register(?SERVER, Pid),
       {ok, Pid}.
    

    This way we send finish_init message before registration, and before any one else could send and count message.

    But such approach has it's own drawbacks, mainly registration itself which could fail in few different ways. One could always check how OTP handles that, and duplicate this code. But this is another story.

    So in the end it all depends on what you need, or even what problems you will encounter in production. It is important to have some idea what bad could happen, but I personally wouldn't try to fix any of it until I would actually suffer from such race condition.