Search code examples
phpnginxdebianopcode-cache

Nginx php-fpm clogs up with writing connections under high load


we have nginx/1.6.2 running with php5-fpm (5.6) on a debian 8 system.

In the past days we got higher load than usual due to more users hitting our servers. With most visitors coming in the evening hours between 6pm and midnight.

Since a couple of days, two different servers runnning the above setup showed very slow response rates for several hours. In Munin, we saw, that there were suddenly hundreds of nginx connections in "writing" state were there were previously only about 20 at a time.

We do not get any errors other than timed out connections on remote hosts when trying to access those servers. All logs I saw were just normal.

The problem can be fixed with a restart of php5-fpm.

My question now is: why do suddenly hundreds of processes claim they are writing? Is there some known issue or maybe config setting we missed which could cause this?

Here is the complete list of symptoms we see:

  • Instead of < 20 very fast active connections /s we see up to 100 to 900 connections in writing state (all nginx connections hit php5-fpm, static content is not served by these servers) Avg. script runtime for the php scripts is 80ms.
  • Problem occurs only if total amount of nginx requests /s goes above 300 /s, It then drops from ~350 to ~250 req/s but these 250 show up to 900 "writing" connections
  • Many of these connections eventually time out and give no correct result
  • There are no errors in our logs
  • The eth / database traffic as well as CPU load correspond to the lower level of 250req/s to which the total drops, so there is no "writing" happening afaik.

For the setup: as stated above. We use the build-in opcode cache of Zend, the APCu for some user variable cache, one of the servers runs a memcache instance (which works fine throughout the problem) and the other is running a Redis version, which also runs fine while the problem occurs.

Can anyone shed some light to what the problem might be?

Thanks!


Solution

  • We found the problem: APCu seems to be unstable with PHP 5.6.

    Details:

    • debian 8
    • nginx/1.6.2
    • PHP 5.6.14-0+deb8u1
    • APCu 4.0.7 (Revision: 328290, 126M shm_size)

    we used xhprof to profile requests when the server was slow (see question) and noticed, that APCu took > 100ms per read/write operation. Clearing the APCu variables did not help. All other parts of the code had normal speed.

    We completely disabled our use of APCu and the system has been stable since.

    So it seems, that this APCu version is unstable under load with PHP 5.6. At least for us.