I'm using Intel Vtune Amplifier XE 2013 to profile a parallel program running on a multicore CPU, in particular it is written in OpenCL and executed in Xeon Phi. I wonder how should be the exact interpretation of the results brought by Vtune, i.e.,
VTune samples all cores by default on Xeon Phi, the results can be viewed in either way: aggregated or per core. Use Grouping drop down box in the BottomUp tab in GUI to regulate the way of data aggregation, use "change Viewpoint" in order to switch between hotspots, event counts and other available views.
For more information on OpenCl analysis by VTune on Xeon Phi please refer to below articles: