Search code examples
c++benchmarkingtiming

fast c++ sign function


In my code I'm doing a sign check on a double numerous times in a loop and that loop is typically run several million times over the duration of the execution.

My sign check is a pretty rudimentary calculation using fabs() so I figured there must be other ways of doing it that are probably quicker since "dividing is slow". I came across a template function and copysign() and created a simple program to run a speed comparison. I've tested the three possible solutions with the code below.

// C++ program to find out execution time of  of functions 
#include <chrono> 
#include <iostream> 
#include <math.h>

using namespace std; 
using namespace std::chrono; 

template<typename Clock>

void printResult(const std::string name, std::chrono::time_point<Clock> start, std::chrono::time_point<Clock> stop, const int iterations)
{
    // Get duration. 
    std::chrono::duration my_duration = duration_cast<nanoseconds>(stop - start); 
    my_duration /= iterations;

    cout << "Time taken by "<< name <<" function: " << my_duration.count() << " ns avg. for " << iterations << " iterations." << endl << endl; 
}


template <typename T> int sgn(T val) 
{
    return (T(0) < val) - (val < T(0));
}


int main() {

    // ***************************************************************** //
    int numiters = 100000000;
    double vel = -0.6574;
    double result = 0;
    
    // Get starting timepoint 
    auto start_1 = high_resolution_clock::now(); 
    for(int x = 0; x < numiters; x++) 
    {

        result = (vel/fabs(vel)) * 12.1;

    }

    // Get ending timepoint 
    auto stop_1 = high_resolution_clock::now(); 
    cout << "Result is: " << result << endl;
    printResult("fabs", start_1, stop_1, numiters);

    // Get starting timepoint 
    result = 0;
    auto start_2 = high_resolution_clock::now(); 
    for(int x = 0; x < numiters; x++) 
    {

        result = sgn(vel) * 12.1;

    }

    // Get ending timepoint 
    auto stop_2 = high_resolution_clock::now(); 
    cout << "Result is: " << result << endl;
    printResult("sgn", start_2, stop_2, numiters);


    // Get starting timepoint 
    result = 0;
    auto start_10 = high_resolution_clock::now(); 
    for(int x = 0; x < numiters; x++) 
    {

        result = copysign(12.1, vel);

    }

    // Get ending timepoint 
    auto stop_10 = high_resolution_clock::now(); 
    cout << "Result is: " << result << endl;
    printResult("copysign", start_10, stop_10, numiters);

    cout << endl;


}

When I run the program I'm a little surprised to find that the fabs() solution and the copysign solution are almost identical in execution time. Also, when I run multiple times, I see that the results can be quite variable.

Is my timing correct? And is there a better way of doing what I'm doing than the three examples I've tested?

Update

I've implemented the tests on quick-bench.com where the compiler setting can be specified and all 3 results seem to be almost identical there. I think I may have got something wrong: https://quick-bench.com/q/PJiAmoC2NQIJyuvbdz5ZHUALu2M


Solution

  • As I worn you your test do not measure anything!

    From your quick-bench.com link click godbolt icon and see this disassembly.

    Note all of your versions are converted to this assembly code:

            movabs  rax, -4600370724363619533 # compile time evaluated result move outside measurement loop
    .LBB0_3:                                  # =>This Inner Loop Header: Depth=1
            mov     qword ptr [rsp + 8], rax
            add     rbx, -1                   # measurement loop counter
            jne     .LBB0_3
    

    So basically compiler was able to completely remove test code since it noticed all can be const evaluated at compile time!

    So you have to feed to test some value which compiler willnot be able determine at compile time.

    Here is my attempt to fix your test and its assembly to see what has been optimized. I do not give warranty this measures the right stuff you have to do it your self. Measuring so small and snappy pice of code is relay hard. In fact anything what is executed in so small number of CPU cycles can't be measured precisely and reliably by software.