Search code examples
servercrashubuntu-20.04

Ubuntu crashing - How to diagnose


I have a dedicated server running Ubuntu 20.04, with cPanel 106.11, MySQL 8, PHP 8.1, Elasticsearch 7.17.8 and i run magento 2.4.5-p1. Config Server Security & Firewall is enabled. Every couple of days i get an monitoring alert to say my server doesnt respond to ping and the host has to do a hard reboot, they are getting frustrated with this and say they will turn off monitoring unless i sort this as they have checked all hardware which is fine. This happens at different times and usually overnight.

I have looked through syslog, mysql log, elasticsearch log, magento 2 logs, apache log, kern.log and i cant find the cause of the issue. I have enabled "sar" and the RAM usage around the time is 64%, cpu usage is between 5-10%.

What else can i look at to try and diagnose this issue?

Additional info requested by Wilson:

select count - https://justpaste.it/6zc95   
show global status - https://justpaste.it/6vqvg   
show global variables - https://justpaste.it/cb52m   
full process list - https://justpaste.it/d41lt   
status - https://justpaste.it/9ht1i   
show engine innodb status - https://justpaste.it/a9uem   
top -b -n 1 - https://justpaste.it/4zdbx   
top -b -n 1 -H - https://justpaste.it/bqt57   
ulimit -a - https://justpaste.it/5sjr4   
iostat -xm 5 3 - https://justpaste.it/c37to   
df -h, df -i, free -h and cat /proc/meminfo - https://justpaste.it/csmwh
htop - https://freeimage.host/i/HAKG0va

Server is using nvme drives, 32GB RAM, 6 cores, MySQL is running on same server as litespeed.

Server has not gone down again since posting this but the datacentre usually reboot within 15 - 20 mins and 99% of the time happens overnight. The server is not accessible over ssh when it crashes.


Solution

  • Rate Per Second = RPS

    Suggestions to consider for your instance (should be available in your cpanel as they are all dynamic variables)

    connect_timeout=30  # from 10 seconds to reduce aborted_connects RPHr of 75 
    innodb_io_capacity=900  # from 200 to use more of NVME IOPS capacity
    thread_cache_size=36  # from 9 to reduce threads_created RPHr of 75
    read_rnd_buffer_size=32768  # from 256K to reduce handler_read_rnd_next RPS of 5,805
    read_buffer_size=524288  # from 128K to reduce handler_read_next RPS of 5,063
    

    Many more opportunities exist to improve performance of your instance. View profile for contact info, please. We are pushing the one question/one answer planned for this platform.