Search code examples
pythonamazon-web-servicesjupyter-notebookjupyterhubamazon-eks

JupyterHub kernel connection returns HTTP504 GATEWAY_TIMEOUT


I am deploying JupyterHub 0.8.2 to kubernetes (EKS on AWS, v1.13).

When I deploy the JupyterHub application to EKS via helm, everything deploys and starts fine. However, when I spawn a notebook server and create a python notebook, the kernel hangs when trying to connect. (See screenshots at the bottom)

I saw a similar issue posted here: https://github.com/jupyter/notebook/issues/2664, it seems there was a regression in tornado python package. However, I tried downgrading to 5.1.1 and that did not fix the issue...

What are the next troubleshooting steps I can try? Where can I find diagnostic info / logs for python kernel?

enter image description here

enter image description here

Update: one of our existing clusters that was running fine for about 2 months, started experiencing this kernel issue just today. This makes me wonder if this is some sort of regression, however how would this affect a jupyterhub deployment that has not been modified? Does jupyterhub update libraries/packages by itself, without consent?

Update 2: I inspected network traffic in browser, and discovered that the request to https://<<JUPYTERHUB_DOMAIN>>/user/me/api/kernels/<<KERNEL_ID>>/channels?session_id=<<SESSION_ID>> is returning HTTP 504 GATEWAY_TIMEOUT

Detailed HTTP request:

GET wss://<<MY_JHUB_DOMAIN>>/user/me/api/kernels/eaf397d3-36da-473c-8342-c4d4d3ad5256/channels?session_id=fa79dc80238648b8b1ea4c3982cb0612 HTTP/1.1
Host: <<MY_JHUB_DOMAIN>>
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
Upgrade: websocket
Origin: https://<<MY_JHUB_DOMAIN>>
Sec-WebSocket-Version: 13
User-Agent: redacted
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: redacted
Sec-WebSocket-Key:redacted
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

Detailed HTTP response:

HTTP/1.1 504 GATEWAY_TIMEOUT
Content-Length: 0
Connection: keep-alive

data:undefined,

Solution

  • The issue was that the proxy-public ELB was switched to listen on http instead of tcp and this broke the kernel endpoint since it uses web sockets.

    Credit goes to the OP for figuring out their own issue.