How do you compute the execution time of instructions? Is it just done by checking what the chip manufacturers say in terms of how many clock cycles an action may take to complete? Is there anything else i should know about this? Feels like i'm missing something....
The RDTSC instruction is extremely accurate as far as I know.
I think if you are seeking the exact cycle counts, then in the case of short boostable sections you may run into the issues of simultaneity that Mysticial mentioned...
But if ultra-ultra-ultra-ultra-precision is not an obstacle... that is to say, if you can survive knowing that for certain scenarios your result is off by... I dunno... say 9 to 80 cycles... then I'm pretty sure you can still get very accurate results with RDTSC... especially when one considers that 9 to 80 divided by 3.2 billion is a very tiny number :)
The numbers 9 and 80 were chosen a bit arbitrarily (and maybe you aren't on a 3.2ghz cpu speed either) since I dunno exactly what the error amount is... but I'm pretty sure its in that ballpark :)
Here's the RDTSC excerpt of a timer function I use:
//High-Rez Setup
__asm
{
push eax
push edx
rdtsc
mov [AbsoluteLow],eax
mov [AbsoluteHigh],edx
pop edx
pop eax
}
actually I'll go ahead and post the whole thing... this code assumes that the type "double" is a 64-bit floating point number... which might not be a universal compiler / architecture assumption:
double AbsoluteTime;
double AbsoluteResolution;
ulong AbsoluteLow;
ulong AbsoluteHigh;
void Get_AbsoluteTime (double *time)
{
//Variables
double current, constant;
double lower, upper;
ulong timelow, timehigh;
//Use the Intel RDTSC
__asm
{
push eax
push edx
rdtsc
sub eax, [AbsoluteLow]
sbb edx, [AbsoluteHigh]
mov [timelow], eax
mov [timehigh], edx
pop edx
pop eax
}
//Convert two 32bit registers to a 64-bit floating point
//Multiplying by 4294967296 is similar to left-shifting by 32 bits
constant = 4294967296.0;
lower = (double) timelow;
upper = (double) timehigh;
upper *= constant;
current = lower + upper;
current /= AbsoluteResolution;
current += AbsoluteTime;
*time = current;
}
void Set_AbsoluteTime (double time, double scale)
{
//Variables
double invScale;
//Setup
AbsoluteTime = time;
//High-Rez Setup
__asm
{
push eax
push edx
rdtsc
mov [AbsoluteLow],eax
mov [AbsoluteHigh],edx
pop edx
pop eax
}
//Fetch MHZ
if (1)
{
//Local Variables
int nv;
ulong mhz;
char keyname[2048];
//Default assumption of 3.2ghz if registry functions fail
mhz = 3200;
//Registry Key
sprintf (keyname, "HARDWARE\\DESCRIPTION\\System\\CentralProcessor\\0");
nv = Reg_Get_ValueDW (keyname, "~MHz", (ulong *)&mhz);
//Transform into cycles per second
mhz *= 1000000;
//Calculate Speed Stuff
AbsoluteResolution = (double) mhz;
invScale = 1.0;
invScale /= scale;
AbsoluteResolution *= invScale;
}
}
You wanna call Set_AbsoluteTime somewhere before using the Get functions... without the first initial call to Set, the Gets will return erroneous results... but once that onetime call is made you are good to go...
here's an example:
void Function_to_Profile (void)
{
//Variables
double t1, t2, TimeElapsed;
//Profile operations
Get_AbsoluteTime (&t1);
...do stuff here...
Get_AbsoluteTime (&t2);
//Calculate Elapsed Time
TimeElapsed = (t2 - t1);
//Feedback
printf ("This function took %.11f seconds to run\n", TimeElapsed);
}
void main (void)
{
Set_AbsoluteTime (0.000, 1.000);
Function_to_Profile();
}
if for some reason you wanted time measurements to flow backwards at half-speed (maybe handy for game-programming), the initial call would be: Set_AbsoluteTime (0.000, -0.500);
the first parameter to Set is the base time that gets added to all results
I'm pretty sure these functions are more accurate than the most high-rez Windows API timers that currently publicly exist... I think on fast processors they have an error smaller than 1 nanosecond but I'm not 100% sure on that :)
they are accurate enough for my purposes, but do note that the standard initialization of the 40 pre-amble bytes (composed of 'current', 'constant', 'lower', 'upper', 'timelow', 'timehigh') that most C compilers would set to 0xCC or 0xCD will eat some cycles... as will the math performed at the bottom of every Get_AbsoluteTime call...
so for really pristine accuracy you would be best framing whatever it is you want to profile in RDTSC "inlines"... I would make use of the extended x64 registers to store the answer for later subtraction operations instead of messing around with slower memory access...
like for example something like this... this is mainly the concept by the way, because technically VC2010 doesn't allow you to emit x64-Assembly via the __asm keyword :( ...but I think it will give you the conceptual road to travel:
typedef unsigned long long ulonglong;
ulonglong Cycles;
__asm
{
push rax
push rdx
rdtsc
mov r9, edx
shl r9, 32
and rax, 0xFFFFFFFF
or r9, rax
pop rdx
pop rax
}
...Perform stuff to profile here
__asm
{
push rax
push rdx
rdtsc
mov r10, edx
shl r10, 32
and rax, 0xFFFFFFFF
or r10, rax
sub r10, r9
mov qword ptr [Cycles], r10
pop rdx
pop rax
}
printf ("The code took %s cycles to execute\n", ULONGLONG_TO_STRING (Cycles));
with that code I think the final answer of the number of cycles that elapsed will be in r10, a 64bit register... or in Cycles, a 64bit unsigned integer... with just a handful of cycles of error caused by the bit shifting and stack operations... provided that the code being profiled doesn't shred r9 and r10 hehe... I forget what the most stable extended-x64 registers are...
also the "and rax, 0xFFFFFFFF" may be extraneous because I can't remember if RDTSC zeroes out the upper 32bits of RAX or not... so I included that AND operation just in case :)