Search code examples
out-of-memoryamazon-amikongamazon-eksulimit

Kong k8s deployment fails after seemingly innocent eks worker ami upgrade


After AWS AMI workers upgrade to a new version our kong deployment on k8s fails.
kong version: 1.4
old ami version: amazon-eks-node-1.14-v20200423
new ami version: amazon-eks-node-1.14-v20200723
kubernetes version: 1.14

I see that the new AMI comes with a new docker version: 19.03.06, while the old one ships with 18.09.09. could this cause the issue?

I can see in kong pod logs a lot of signal 9 exits:

2020/08/11 09:00:48 [notice] 1#0: using the "epoll" event method
2020/08/11 09:00:48 [notice] 1#0: openresty/1.15.8.2
2020/08/11 09:00:48 [notice] 1#0: built by gcc 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 
2020/08/11 09:00:48 [notice] 1#0: OS: Linux 4.14.181-140.257.amzn2.x86_64
2020/08/11 09:00:48 [notice] 1#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/08/11 09:00:48 [notice] 1#0: start worker processes
2020/08/11 09:00:48 [notice] 1#0: start worker process 38
2020/08/11 09:00:48 [notice] 1#0: start worker process 39
2020/08/11 09:00:48 [notice] 1#0: start worker process 40
2020/08/11 09:00:48 [notice] 1#0: start worker process 41
2020/08/11 09:00:50 [notice] 1#0: signal 17 (SIGCHLD) received from 40
2020/08/11 09:00:50 [alert] 1#0: worker process 40 exited on signal 9
2020/08/11 09:00:50 [notice] 1#0: start worker process 42
2020/08/11 09:00:51 [notice] 1#0: signal 17 (SIGCHLD) received from 39
2020/08/11 09:00:51 [alert] 1#0: worker process 39 exited on signal 9
2020/08/11 09:00:51 [notice] 1#0: start worker process 43
2020/08/11 09:00:52 [notice] 1#0: signal 17 (SIGCHLD) received from 41
2020/08/11 09:00:52 [alert] 1#0: worker process 41 exited on signal 9
2020/08/11 09:00:52 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:52 [notice] 1#0: start worker process 44
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:243: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:269: randomseed(): random seed: 255136921215 for worker nb 0
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events, event=started, pid=38, data=nil
2020/08/11 09:00:48 [notice] 38#0: *1 [lua] cache_warmup.lua:42: cache_warmup_single_entity(): Preloading 'services' into the cache ..., context: init_worker_by_lua*
2020/08/11 09:00:48 [warn] 38#0: *1 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
2020/08/11 09:00:53 [notice] 1#0: signal 17 (SIGCHLD) received from 38
2020/08/11 09:00:53 [alert] 1#0: worker process 38 exited on signal 9
2020/08/11 09:00:53 [notice] 1#0: start worker process 45
2020/08/11 09:00:54 [notice] 1#0: signal 17 (SIGCHLD) received from 42
2020/08/11 09:00:54 [alert] 1#0: worker process 42 exited on signal 9
2020/08/11 09:00:54 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:54 [notice] 1#0: start worker process 46
2020/08/11 09:00:55 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:55 [notice] 1#0: signal 17 (SIGCHLD) received from 43
2020/08/11 09:00:55 [alert] 1#0: worker process 43 exited on signal 9
2020/08/11 09:00:55 [notice] 1#0: start worker process 47
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 44
2020/08/11 09:00:56 [alert] 1#0: worker process 44 exited on signal 9
2020/08/11 09:00:56 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:56 [notice] 1#0: start worker process 48
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 45
2020/08/11 09:00:56 [alert] 1#0: worker process 45 exited on signal 9
2020/08/11 09:00:58 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:58 [notice] 1#0: start worker process 49
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 46
2020/08/11 09:00:59 [alert] 1#0: worker process 46 exited on signal 9
2020/08/11 09:00:59 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:59 [notice] 1#0: start worker process 50
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 47

only critical message is:
[crit] 235#0: *45 [lua] balancer.lua:749: init(): failed loading initial list of upstreams: failed to get from node cache: could not acquire callback lock: timeout, context: ngx.timer

looking at kubectl describe pod kong... I see OOMKilled could this be a memory issue?


Solution

  • The new node ami ulimit (nofile) has changed to 1048576, which is a big change from 65536 which caused memory issues with our current Kong setup, and thus failing to deploy.

    Changing the new node file limit to the previous value fixed the kong deployment.
    Although we decided to increase Kong memory request instead, which also fixes the issue.

    relevant github issue