Search code examples
performanceassemblyexecutioninstructions

Assembly: Compute Execution Time of Instructions


How do you compute the execution time of instructions? Is it just done by checking what the chip manufacturers say in terms of how many clock cycles an action may take to complete? Is there anything else i should know about this? Feels like i'm missing something....


Solution

  • The RDTSC instruction is extremely accurate as far as I know.

    I think if you are seeking the exact cycle counts, then in the case of short boostable sections you may run into the issues of simultaneity that Mysticial mentioned...

    But if ultra-ultra-ultra-ultra-precision is not an obstacle... that is to say, if you can survive knowing that for certain scenarios your result is off by... I dunno... say 9 to 80 cycles... then I'm pretty sure you can still get very accurate results with RDTSC... especially when one considers that 9 to 80 divided by 3.2 billion is a very tiny number :)

    The numbers 9 and 80 were chosen a bit arbitrarily (and maybe you aren't on a 3.2ghz cpu speed either) since I dunno exactly what the error amount is... but I'm pretty sure its in that ballpark :)

    Here's the RDTSC excerpt of a timer function I use:

    //High-Rez Setup
    __asm
    {
        push        eax
        push        edx
        rdtsc
        mov         [AbsoluteLow],eax
        mov         [AbsoluteHigh],edx
        pop         edx
        pop         eax
    }
    

    actually I'll go ahead and post the whole thing... this code assumes that the type "double" is a 64-bit floating point number... which might not be a universal compiler / architecture assumption:

    double              AbsoluteTime;
    double              AbsoluteResolution;
    ulong               AbsoluteLow;
    ulong               AbsoluteHigh;
    
    
    
    void Get_AbsoluteTime (double *time)
    {
        //Variables
        double  current, constant;
        double  lower, upper;
        ulong   timelow, timehigh;
    
        //Use the Intel RDTSC
        __asm
        {
            push    eax
            push    edx
            rdtsc
            sub     eax, [AbsoluteLow]
            sbb     edx, [AbsoluteHigh]
            mov     [timelow], eax
            mov     [timehigh], edx
            pop     edx
            pop     eax
        }
    
        //Convert two 32bit registers to a 64-bit floating point
        //Multiplying by 4294967296 is similar to left-shifting by 32 bits
        constant     = 4294967296.0;
        lower        = (double) timelow;
        upper        = (double) timehigh;
        upper       *= constant;
        current      = lower + upper;
        current     /= AbsoluteResolution;
        current     += AbsoluteTime;
        *time        = current;
    }
    
    
    
    void Set_AbsoluteTime (double time, double scale)
    {
        //Variables
        double  invScale;
    
        //Setup
        AbsoluteTime = time;
    
        //High-Rez Setup
        __asm
        {
            push    eax
            push    edx
            rdtsc
            mov     [AbsoluteLow],eax
            mov     [AbsoluteHigh],edx
            pop     edx
            pop     eax
        }
    
        //Fetch MHZ
        if (1)
        {
            //Local Variables
            int      nv;
            ulong    mhz;
            char     keyname[2048];
    
            //Default assumption of 3.2ghz if registry functions fail
            mhz = 3200;
    
            //Registry Key
            sprintf (keyname, "HARDWARE\\DESCRIPTION\\System\\CentralProcessor\\0");
            nv = Reg_Get_ValueDW (keyname, "~MHz", (ulong *)&mhz);
    
            //Transform into cycles per second
            mhz *= 1000000;
    
            //Calculate Speed Stuff
            AbsoluteResolution = (double) mhz;
            invScale  = 1.0;
            invScale /= scale;
            AbsoluteResolution *= invScale;
        }
    }
    

    You wanna call Set_AbsoluteTime somewhere before using the Get functions... without the first initial call to Set, the Gets will return erroneous results... but once that onetime call is made you are good to go...

    here's an example:

    void Function_to_Profile (void)
    {
        //Variables
        double   t1, t2, TimeElapsed;
    
        //Profile operations
        Get_AbsoluteTime (&t1);
        ...do stuff here...
        Get_AbsoluteTime (&t2);
    
        //Calculate Elapsed Time
        TimeElapsed = (t2 - t1);
    
        //Feedback
        printf ("This function took %.11f seconds to run\n", TimeElapsed);
    }
    
    void main (void)
    {
        Set_AbsoluteTime (0.000, 1.000);
        Function_to_Profile();
    }
    

    if for some reason you wanted time measurements to flow backwards at half-speed (maybe handy for game-programming), the initial call would be: Set_AbsoluteTime (0.000, -0.500);

    the first parameter to Set is the base time that gets added to all results

    I'm pretty sure these functions are more accurate than the most high-rez Windows API timers that currently publicly exist... I think on fast processors they have an error smaller than 1 nanosecond but I'm not 100% sure on that :)

    they are accurate enough for my purposes, but do note that the standard initialization of the 40 pre-amble bytes (composed of 'current', 'constant', 'lower', 'upper', 'timelow', 'timehigh') that most C compilers would set to 0xCC or 0xCD will eat some cycles... as will the math performed at the bottom of every Get_AbsoluteTime call...

    so for really pristine accuracy you would be best framing whatever it is you want to profile in RDTSC "inlines"... I would make use of the extended x64 registers to store the answer for later subtraction operations instead of messing around with slower memory access...

    like for example something like this... this is mainly the concept by the way, because technically VC2010 doesn't allow you to emit x64-Assembly via the __asm keyword :( ...but I think it will give you the conceptual road to travel:

    typedef unsigned long long ulonglong;
    ulonglong Cycles;
    
    __asm
    {
        push rax
        push rdx
        rdtsc
        mov r9, edx
        shl r9, 32
        and rax, 0xFFFFFFFF
        or  r9, rax
        pop rdx
        pop rax
    }
    
    ...Perform stuff to profile here
    
    __asm
    {
        push rax
        push rdx
        rdtsc
        mov r10, edx
        shl r10, 32
        and rax, 0xFFFFFFFF
        or  r10, rax
        sub r10, r9
        mov qword ptr [Cycles], r10
        pop rdx
        pop rax
    }
    
    printf ("The code took %s cycles to execute\n", ULONGLONG_TO_STRING (Cycles));
    

    with that code I think the final answer of the number of cycles that elapsed will be in r10, a 64bit register... or in Cycles, a 64bit unsigned integer... with just a handful of cycles of error caused by the bit shifting and stack operations... provided that the code being profiled doesn't shred r9 and r10 hehe... I forget what the most stable extended-x64 registers are...

    also the "and rax, 0xFFFFFFFF" may be extraneous because I can't remember if RDTSC zeroes out the upper 32bits of RAX or not... so I included that AND operation just in case :)