Search code examples
apcphp-5.4opcachemysqlndmysql-error-2006

Possible reasons for "mysql server has gone away" error (php 5.4, mysqlnd)


We recently upgraded one of our webservers from PHP 5.3 (Debian Squeeze package, using libmysqlclient and APC) to PHP 5.4 (Debian Wheezy, Dotdeb package, using mysqlnd, Opcache and APCu). After working fine for almost one day, we experienced "mysql server has gone away" errors for every request. All other servers with the same load which still run PHP 5.3 with libmysqlclient using the same MySQL server had no problem at all. On all servers we use:

max_execution_time = 60
default_socket_timeout = 60

On our PHP 5.3 servers we did not change any mysql/my.cnf timeouts. We know about problems with read_timeout (mysql), wait_timeout (mysql), default_socket_timeout (php) and max_execution_time (php), but only in context of batch scripts with long running queries. Our webservers usually respond in about 300ms, so those timeouts should not be an issue here.

It became really strange when we removed the server from loadbalancing, so there was no load anymore, but we still had 180 busy Apache processes. Even a apache2ctl graceful did not change anything, even hours later apache2ctl status said:

                       Apache Server Status for localhost

   Server Version: Apache/2.2.22 (Debian)
   Server Built: Jun 16 2014 03:51:14
     __________________________________________________________________

   Current Time: Tuesday, 22-Jul-2014 10:17:44 CEST
   Restart Time: Monday, 21-Jul-2014 18:43:37 CEST
   Parent Server Generation: 26
   Server uptime: 15 hours 34 minutes 6 seconds
   Total accesses: 596973 - Total Traffic: 1.6 GB
   CPU Usage: u6288.72 s463.96 cu.01 cs0 - 12% CPU load
   10.7 requests/sec - 30.8 kB/second - 2962 B/request
   176 requests currently being processed, 99 idle workers

GGGGGG_GGGGGGGGG_GG_GGGGGGGGGGGGGGGGGGGG_GGGGGG_GGGGGGGG_GGGGGGG
GGGGGGGGGG_G_GGGGGGGG_G_GG__GGGGGG_GGGGG_GGG___GG_GGGGGGGG_G_GGG
GGGGGGGGGGGG_G_GG__GG_GGG_GGGGGGGGG__GGG_GGG_G_G_GG_G_GGGGGGGGGG
GGG_GGG_GG_GGG_GG_G_GGG_______________.___._W___________________
____.___________.______.........................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
....

   Scoreboard Key:
   "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
   "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
   "C" Closing connection, "L" Logging, "G" Gracefully finishing,
   "I" Idle cleanup of worker, "." Open slot with no current process

Only apache2ctl restart solved the issue and everything worked fine again. The MySQL error is the only "useful" error message we found so far.

Could it be an issue with mysqlnd, opcache or apcu and PHP 5.4.30? Are there any known problems which could result in the behavior we have experienced?

Or do you have an idea how to debug the "mysql server has gone away" issue?


Solution

  • We probably found out why it comes to the "mysql server has gone away" error: On the MySQL server we configured a wait_timeout of 30 seconds, which is less than 60 seconds max_execution_time. So under certain conditions something seems to take more than 30 seconds while we are reading a result-set from MySQL, so the server closes the connection while we are still trying to get data from the server. That leads us to the next questions:

    1. What function consumes so much time while we are in a loop reading a result-set from mysql?

    2. Why does apache2ctl graceful not restart the Apache Processes, even when max_execution_time should abort the scripts after 60 seconds?

    I think the answer to both questions is a bug in APCu. Because if I look at the hanging Apache childs, I get FUTEX_WAIT from strace:

    [pid 28354] futex(0x7f3a8c3d2094, FUTEX_WAIT, 69, NULL <unfinished ...>
    

    If I look at such a process using gdb it seems to hang at pthread_rwlock_wrlock(), what I get is:

    0x00007f3adcd18abd in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
    

    OK, pthread_rwlock is used for locking in APCu, and problems with locking mechanism is a really good explanation for what we see here, in that we definitely have code which reads/writes to APCu inside of loops through MySQL result-sets, and if there is a problem with locking (which already has been a problem for us in the past with APC as well) it could take >30 and <60 seconds so the MySQL error is what we see. And after that situation something with APCu goes really wrong, so the php script can't be aborted by max_execution_time and not be restarted by apache2ctl graceful anymore.

    In APCu issue tracker I could find very similar issues: https://github.com/krakjoe/apcu/issues/19

    But we found another hint. The crash always happens, when there are about 70k keys in APCu, and it does not depend on apc.shm_size, but we found out that our APCu monitoring script produces "PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 78 bytes)" errors when calling apcu_cache_info() at line 47 at the same time when we see the crash. So we have to look into the script why it consumes so much memory, AFAIR id read's all the data for calculating memory fragmentation, perhaps we should remove that part...

    But we had a lot of problems with APC in the past, we switched to APCu/Opcache only because we got seg faults with latest APC and PHP 5.4.30 and the issue mentioned above is open for one year now. We are happy to see recent activity on yac, perhaps lockless is a more stable option. If we can't fix by removing problems from our monitoring script, we will switch to local memcached instances, it will be slower but we know it's very stable.