nginx webserver pid ubuntu-20.04 systemctl

NGINX Process Stops: bind failed, kill failed, pid disappeared

My NGINX is doing weird things I don't understand:

Every day or even multiple times a day, the process just stops.

This is the error log file:

2022/04/15 09:49:23 [notice] 9327#9327: signal process started
2022/04/15 09:49:23 [alert] 9327#9327: kill(9311, 1) failed (3: No such process)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:8888 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:8888 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:8888 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:8888 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:80 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to [::]:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:443 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: bind() to 0.0.0.0:8888 failed (98: Address already in use)
2022/04/15 09:49:23 [emerg] 9328#9328: still could not bind()

lsb_release -a

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:    20.04
Codename:   focal

Also, at other point (when working with certbot) I have noticed that the pid file disappeared.

I think, what is happening is, that some service is restarting NGINX and trying to take the ports that are still reserved by the not yet stoped old service, so the new service errors out and then finally the old service stops.

I checked, that no other process is interfering and taking these ports, I also don't have apache2 installed. This lets me to believe what I described above.

I can restart NGINX using systemctl restart nginx or using killall nginx; systemctl start nginx.

Interesting side note: It happens, that systemctl status nginx show the NGINX process as 'failed' but NGINX is still running. I believe this is due to the missing pid file.

If you have any idea, how I can debug this or fix it, I'd be really thankful. This is not a state I can leave my webserver in. I'd be happy to provide any information or log you might need.

Solution

I found the solution:

Your problem sounds familiar. One earlier case was this thread. It is long so I will summarize here.

Do you by chance have perl enabled? If so, try disabling it, restart nginx and see if that allows the renew.

Why? A conflict with nginx can result using the nginx plug-in as after it makes the temp changes to your nginx conf it reloads it using SIGHUP. That's fine but if that fails it will start nginx but not using systemd. This creates an nginx that cannot be managed by systemd and the two nginx fight each other for ports leading to the symptom you saw.

Now, various things can cause the SIGHUP to fail. A common one is not having nginx running before doing the renew. Of course then the sighup will fail. You said nginx was running so likely not your cause.

I mention perl only because that explained the SEGV that the nginx sighup was failing with in the thread I linked to. We would have to dig through your system logs like we did in this linked thread. But, it would be a quick test if you had perl just to disable it.

A work-around is to use webroot as that avoids the nginx plug-in altogether. Webroot uses your running nginx as it is.

Source: Lets Encrypt Forum: https://community.letsencrypt.org/t/auto-renewal-nginx-pid-disappears-nginx-doest-restart/179794

Solution by MikeMcQ