Search code examples
elixirgen-server

In Gossip GenServer processes dying before exit condition


I am creating multiple GenServers gossiping by sending message to each others. I have set an exit condition to make every process die once it has received 10 messages. Each GenServer is created at the beginning of the gossip in the launch function.

defmodule Gossip do
    use GenServer

    # starting gossip
    def start_link(watcher \\ nil), do: GenServer.start_link(__MODULE__, watcher)
    def init(watcher), do: {:ok, {[],0,watcher}}
    def launch(n, watcher \\ nil) do
        crew = (for _ <- 0..n, do: elem(Gossip.start_link(watcher),1))
        Enum.map(crew, &(add_crew(&1,crew--[&1])))
        crew
            |> hd()
            |> Gossip.send_msg()
    end 


    # client side
    def add_crew(pid, crew), do: GenServer.cast(pid, {:add_crew, crew})
    def rcv_msg(pid, msg \\ ""), do: GenServer.cast(pid, {:rcv_msg, msg})
    def send_msg(pid, msg \\ ""), do: GenServer.cast(pid, {:send_msg, msg})


    # server side  
    def handle_cast({:add_crew, crew}, {_, msg_counter, watcher}), do:
        {:noreply, {crew, msg_counter, watcher}}

    def handle_cast({:rcv_msg, _msg}, {crew, msg_counter, watcher}) do
        if msg_counter < 10 do
            send_msg(self())
        else
            GossipWatcher.increase(watcher)
            IO.inspect(self(), label: "exit of:") |> Process.exit(:normal)
        end
        {:noreply, {crew, msg_counter+1, watcher}}
    end

    def handle_cast({:send_msg,_},{[],_,_}), do: Process.exit(self(),"crew empty")
    def handle_cast({:send_msg, _msg}, {crew, msg_counter, watcher}=state) do
        rcpt = Enum.random(crew) ## recipient of the msg
        if Process.alive?(rcpt) do
            IO.inspect({self(),rcpt}, label: "send message from/to")
            rcv_msg(rcpt, "ChitChat")
            send_msg(self())
            {:noreply, state}
        else
        IO.inspect(rcpt, label: "recipient is dead:")
        {:noreply, {crew -- [rcpt], msg_counter, watcher}}
        end
    end
end


defmodule GossipWatcher do
    use GenServer

    def start_link(opt \\ []), do: GenServer.start_link(__MODULE__, opt)
    def init(opt), do: {:ok, {0}}
    def increase(pid), do: GenServer.cast(pid, {:increase})  
    def handle_cast({:increase}, {counter}), do:
        IO.inspect({:noreply, {counter+1}}, label: "toll of dead")

end

I use the module GossipWatcher to monitor that number of GenServer who dies, after having received 10 messages. The issue is that the iex prompt back whereas there are still some GenServers alive. For example over 1000 GenServer, only ~964 GenServers die at the end of the gossip.

iex(15)> {:ok, watcher} = GossipWatcher.start_link
{:ok, #PID<0.11163.0>}
iex(16)> Gossip.launch 100, watcher            
send message from/to: {#PID<0.11165.0>, #PID<0.11246.0>}
:ok     
send message from/to: {#PID<0.11165.0>, #PID<0.11167.0>}
send message from/to: {#PID<0.11246.0>, #PID<0.11182.0>}
send message from/to: {#PID<0.11165.0>, #PID<0.11217.0>}
...
toll of dead: {:noreply, {960}}
toll of dead: {:noreply, {961}}
toll of dead: {:noreply, {962}}
toll of dead: {:noreply, {963}}
toll of dead: {:noreply, {964}}
iex(17)>

Am I missing something here ? Is the process timing out ? Any help would be appreciated
TIA.


Solution

  • The part of your code that can play some tricks is here:

    def handle_cast({:send_periodic_message}, zero_counter_gossip_true) do
    
        ...
    
        if (Process.alive?(rcpt)) == true do
    
        ...
    
        else
            IO.inspect(rcpt, label: "recipient is dead:")
            {:noreply, {crew -- [rcpt], msg_counter, watcher}}
        end
    end
    

    In this part of the else, you allow the GenServer to stop working: since it does not send a message to a neighbor or himself, no "action" are launched and it simply stop doing something.
    In the worst and unlikely case possible: if you start 2000 GenServer and launch the gossip from one GenServer, and that this first one only talks to a second one which also only talk to the first one.... then only one GenServer is going to die, and you get back the command prompt, with still 1999 GenServer alive but doing nothing (since they are receiving 0 messages).

    Even if this case is far fetched, it shows that the execution of the gossip can end prematurely before every GenServer has received 10 messages. Hence the behavior you describe.


    I did some test, rewriting your code, and using a second type of GenServer to monitor how many GenServers are killed, and how many survive. It turns out that out of 1000 GenServers, I get an average of 40 GenServer still alive after I got back the iex prompt.