I am having a recurring problem when using perf with Intel-PT event. I am currently performing profiling on a Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz machine, with x86_64 architecture and 32 hardware threads with virtualization enabled. I specifically use programs/source codes from SpecCPU2006 for profiling.
I am specifically observing that the first time I perform profiling on one of the compiled binaries from SpecCPU2006, everything works fine and the perf.data file gets generated, which is as expected with Intel-PT. As SpecCPU2006 programs are computationally-intensive(use 100% of CPU at any time), clearly perf.data files would be large for most of the programs. I obtain roughly 7-10 GB perf.data files for most of the profiled programs.
However, when I try to perform profiling the second time on the same compiled binary, after the first one is successfully done -- my server machine freezes up. Sometimes, this happens when I try profiling the third time/the fourth time (after the second or third profiling completed successfully). This behavior is highly unpredictable. Now I cannot profile any more binaries unless I have restarted the machine again.
I have also posted the server error logs which I get once I see that the computer has stopped responding.
Clearly there is an error message saying Fixing recursive fault but reboot is needed!.
This happens for particularly large enough SpecCPU2006 binaries which take more than 1 minute to run without perf.
Is there any particular reason why this might happen ? This should not occur due to high CPU usage, as running the programs without perf or with perf but any other hardware event(that can be seen by perf list) completed successfully. This only seems to happen with Intel-PT.
Please guide me in using the steps to solve this problem. Thanks.
Seems I resolved this issue now. So will post an answer.
The server crashed because of a null pointer dereference/access happening with a specific member of the structure perf_event
. Basically the member perf_event->handle
was the culprit. This information, as suggested by @osgx, was obtained from var/log/syslog file. A portion of the error message was :-
Apr 19 04:49:15 ###### kernel: [582411.404677] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ea
Apr 19 04:49:15 ###### kernel: [582411.404747] IP: [] perf_event_aux_event+0x2e/0xf0
One possible scenario where this structure member turns out to be NULL is if I start capturing packets even before an earlier run of perf record finished releasing all of its resources. This has been properly handled in kernel version 4.10. I was using kernel version 4.4.
I upgraded my kernel to the newer version and it works fine now!