I have found that one common reason for the error is an exception being thrown from within an exception handler. I'm quite sure this doesn't happen in the application I'm trying to debug... But I've put all the initialization processing lines at the top of index.php in a try/catch.*
It can apparently also happen because some things cannot be serialized to be stored in a session. At most this application stores arrays into the session (quite a bit), but I'm confident that it doesn't store anything too out of the ordinary in it.
Someone commented that it happened to them because their primary key needed to be CHAR(32) instead of INT(11). The PK's in this app are all INTs.
Other suggestions are that it could be a problem with PHP 5.3.3 fixed in 5.3.6, full disk, and a need to typecast a SimpleXML value. We do happen to be running PHP 5.3.3, but upgrading would have to be a last resort in this case. It hasn't always been doing this.
UPDATE/NOTE: I actually can't reproduce the error myself, only see it happening in the logs, see below paragraph for where I believe the error is happening...
* From the error logs, it seems likely that at least one place it is happening is index.php. I am deducing this only because it is indicated in some entries by a referring URL. The try/catch code is currently only around the "top" initialization portion of the script, below that is mostly the HTML output. There is some PHP code in the output (pretty straightforward stuff though), so I may need to test that. Here is the catch part, which is not producing any output in the logs:
} catch (Exception $e) {
error_log(get_class($e)." thrown. Message: ".$e->getMessage(). " in " . $e->getFile() . " on line ".$e->getLine());
error_log('Exception trace stack: ' . print_r($e->getTrace(),1));
}
Would really appreciate any tips on this!
EDIT: PHP is running as an Apache module (Server API: Apache 2.0 Handler). I don't think there are any PHP accelerators in use, but it could just be that I don't know how to tell. None of the ones listed on Wikipedia are in phpinfo().
As far as I can tell the MPM is prefork. This is the first I'd ever looked into the MPM:
# ./httpd -l
Compiled in modules:
core.c
prefork.c
http_core.c
mod_so.c
In short you have a exception thrown somewhere, you have no idea where and up until now you could not reproduce the error: It only happens for some people, but not for you. You know that it happens for other people, because you see that in the error logs.
Since you have already eliminated the common reasons you will need to reproduce the error. If you know which parameter will cause the error it should be easy to locate the error.
Let's start simple: To get all parameters you can write in the very beginning of the affected php file something like:
file_put_contents("/path/to/some/custom_error_log", date()."\n".print_r(get_defined_vars(), true), FILE_APPEND | LOCK_EX);
Don't forget that the custom_error_log file must be writable to your php application. Then, when the error occurs in the error log, find the corresponding lines in your custom_error_log file. Hopefully there are not to many requests per second so that you can still identify the request. Maybe some additional parameters in the error log like source ip can help you identify the request (if your error log shows that). From that data, reconstruct a request with the same POST/GET parameters.
The next option that is very simple as well, but requires you to have root-access on your target machine is to install tcpflow. Then create a folder, cd into that folder and simply execute (as root) tcpflow "port 80"
. The option (port 80) is a pcap filter expression. To see all you can do with that, see man pcap-filter
. There is a lot what these filter expressions can do.
Now tcpflow will record all tcp connections on port 80, reconstruct the full data exchange by combining the packages belonging to one connection and dump this data to a file, creating two new files per connection, one for incoming data and one for outgoing data. Now find the files for a connection that caused an error, again based on the timestamp in your error log and by the last modified timestamp of the files. Then you get the full http request headers. You can now reconstruct the HTTP request completely, including setting the same accept-encoding, user-agent, etc. You can even pipe the request directly into netcat, replaying the exact request. Beware though that some arguments like a sessionid might be in your way. If php discovers that a session is expired you may just get a redirect to a login or something else that is unexpected. You may need to exchange things like the session id.
If none of this helps and you can't reproduce the error on your machine, then you can try to mock everything that is hard to mock. For example the source ip adress. This might make some stunts necessary, but it is possible: You can connect to your server using ssh with the "-w" option, creating a tunnel interface. Then assign the offending ip adress to your own machine and set routes (route add host ) rules to use the tunnel for the specific ip. If you can cable the two computers directly together then you can even do it without the tunnel.
Don't foget to mock the session which should be esiest. You can read all session variables using the method with print_r(get_defined_vars()). Then you need to create a session with exactly the same variables.
Another option would be actually ask the user what he was doing. Maybe you can follow the same steps as he and can reproduce.
If none of that helps... well... Then it gets seriously difficult. The IP-thing is already highly unlikely. It could be a GEO-IP library that causes the error on IPs from a specific region, but these are all rather unlikely things. If none of the above helped you to reproduce the problem, then you probably just did not find the correct request in all the data generated by the custom_log_file-call / tcpflow. Try to increase your chances by getting a more accurate timestamp. You can use microtime() in php as a replacement for date(). Check your webserver, if you can get something more accurate than seconds in your error log. Write your own implementation of "tail", that gives you a more accurate timestamp,... Reduce the load on the system, so that you don't have to choose from that much data (try another time of day, load of users to different servers,...)
Now once you can reproduce it should be a walk in the park to find the actual cause. You can find the parameter that causes the error by trial and error or by comparing it to other requests that caused an error, too, looking for similarities. And then you can see what this parameter does, which libraries access it, etc. You can disable every component one by one that uses the parameter until you can't reproduce anymore. Then you got your component and can dive into the problem deeper.
Tell us what you found. I am curious ;-).