Search code examples
visual-studio-2010profilerperformancecountercpu-cache

How far should one trust hardware counter profiling using VsPerfCmd.exe?


I'm attempting to use VsPerfCmd.exe to profile branch misprediction and last level cache misses in an instrumented native application.

The setup works as it says on the tin, but the results I'm getting don't seem sensible. For instance, a function that always touches a data set of 24MB is reported to only cause ~700 cache misses when being called ~2000 times. Now let me put this into perspective - The function linearly traverses two arrays of 1024*1024 elements of 12-byte elements. For every element, it randomly decides whether it needs information of an element 1024 indices before or after it. That means in order to not generate any cache misses, the CPU would always have to have at least three sections of 1024*12 bytes each of both these arrays in cache. Furthermore, after every iteration the process yields the CPU using sleep() for about 8 milliseconds. I can't imagine any hardware prefetcher doing that good a job.

How would this silly amount of data not generate more last level cache misses than VsPerfCmd says? Even though my i7 has 8MB of shared L3 cache, this seems highly unlikely. Can anyone share their opinions on what might be going on here? Of course "VsPerfCmd.exe sucks" would be a valid answer but if someone is going to say that, I'd like to at least hear of a similar experience someone had as a basis for this assertion.


Solution

  • Answering my own question - So, after trying to verify the VsPerfCmd results using Intel VTune Amplifier XE™ (this is no advertising, I just like typing out product names like that because it amuses my how they can be so silly), I can definitely say that they are garbage.

    That's just a rough comparison, as I havent found out how to get the number of times a function was called from VTune, but an approximate 900 calls resulted in 1,040,000 Last Level Cache misses, according to VTune. Contrasting that to the ~ 2000 calls profiled with VsPerfCmd and and the reported ~ 700 LLC misses, it's safe to assume that the VTune results are much more reasonable.

    Of course I cant say anything more specific than "VsPerfCmd was very likely wrong" - The why's and the how's of this phenomenon remain unclear. Should anyone who knows more feel an urge to elaborate on this, shoot me a comment!