I'm currently struggling with this bug, because i don't have much logs and infos provided by my server. I have an endpoint called /stop (code sample provided below) in which i execute a lot of code before returning :ok
. All this code is wrap in multiple functions and through multiple files and services.
Not far from the end of those functions, i'm writing to the database. I have 3 different inserts. But since a few days, i have this error : Postgrex.Protocol (#PID<0.375.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.543.0> exited
showing up randomly in one of those 3 inserts. (It could be the first one, it could be in the middle of the second one, or the last one. And sometimes (rarely) it doesn't even fail). The only thing i've been able to isolate as a "trigger" is when the inserted data is a little bigger. It has always worked before, and it also worked nicely when i'm trying to insert those same datas directly via iex.
On the Postgres side, i only have this log : LOG: unexpected EOF on client connection with an open transaction
.
The controller code looks like this :
def stop(conn, params) do
case @influx_caller.end_connexion_performable?(params) do
true ->
handle_end_of_connexion(params)
conn |> send_resp(:ok, [])
false ->
if @influx_caller.connexion_data_from(params["connexion_key"]) == %{} do
update_connexion_with(:influx_empty, params, true)
end
conn |> send_resp(:forbidden, [])
end
end
end
end
The failure happens in the happy path (the true option of the case) in one of the files called by the handle_end_of_connexion
function. For example in this query :
Ecto.Changeset.change(connexion, %{
closed_at: Calculation.get_close_time(events),
})
|> Ecto.Changeset.put_assoc(:infos, MyApp.InfosGenerator.perform(obj, obj_type, twi, twl, events))
|> MyApp.Repo.update
That exact same file with that exact same data can be called from another controller in my app (sort of admin panel), and it works just fine. So i think my problem is coming from the /stop endpoint.
Things i've tested so far :
Checking my OTP version. It was 21.3 and i found a lot of reports talking about similar errors coming from it, so i updated it to 22.1.7
Activating the show_sensitive_data_on_connection_error
hoping it would add some logs to investigate, but no luck.
Updating phoenix, plug_cowboy, postgrex, and ecto to the latest version.
Adding http: [protocol_options: [idle_timeout: :infinity]],
in my Mix.Config, since i've also read about a timeout from cowboy since it's 2.0 version.
None of this works. (Of course, i also tried to log things everywhere, but everything is working just fine in local mode)
Any help or at least lead would be much appreciated. Thanks in advance !
[EDIT] Thanks to observer, i managed to collect those additional information :
When this errors happens : Postgrex.Protocol (#PID<0.422.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.31212.0> exited
, i have those lines popping in observer =>
11:06:52:783159 (<0.31212.0>) out {lists,reverse,1}
11:06:52:783212 (<0.31212.0>) in {lists,reverse,1}
11:06:52:783218 (<0.31212.0>) << {inet_reply,#Port<0.71>,ok}
11:06:52:783234 (<0.31212.0>) out {prim_inet,send_recv_reply,2}
11:06:52:783243 (<0.31212.0>) in {prim_inet,send_recv_reply,2}
11:06:52:783247 (<0.31212.0>) out {prim_inet,recv0,3}
11:06:52:797698 (<0.31209.0>) in {cowboy_http,loop,1}
11:06:52:797725 (<0.31209.0>) << {tcp_closed,#Port<0.140>}
11:06:52:797750 (<0.31209.0>) getting_unlinked #Port<0.140>
11:06:52:797786 (<0.31209.0>) out {cowboy_children,terminate_loop,2}
11:06:52:797794 (<0.31212.0>) in {prim_inet,recv0,3}
11:06:52:797800 (<0.31212.0>) exit shutdown
11:06:52:797820 (<0.31212.0>) out_exited 0
11:06:52:797905 (<0.31209.0>) in {cowboy_children,terminate_loop,2}
11:06:52:797911 (<0.31209.0>) getting_unlinked <0.31212.0>
11:06:52:797915 (<0.31209.0>) << {'EXIT',<0.31212.0>,shutdown}
11:06:52:797943 (<0.31209.0>) exit {shutdown,{socket_error,closed,'The socket has been closed.'}}
11:06:52:797951 (<0.31209.0>) out_exited 0
The process was spawn here :
11:06:52:304313 (<0.31209.0>) spawn <0.31212.0> as proc_lib:init_p(<0.31209.0>,[<0.447.0>,<0.446.0>,'Elixir.Tracking.Endpoint',<0.414.0>,<0.413.0>],cowboy_stream_h,request_process,[#{body_length => 136,cert => undefined,has_body => true, [etc...]
I really don’t understand what and why this process is closed and the link is lost :/
It turns out that since a recent cowboy update, the request timeout is handled differently. The best (and cleanest) workaround i found was to wrap all the actions made after the call to my controller into an all new process.