Search code examples
c++clinuxcpuinstructions

Is there a programmatic way to estimate the time my CPU takes to perform a fp operation?


By "fp operation" I mean "floating point operation". I'm working on a Linux box. Is there a system call that returns this value as a static metric or can you test this with an algorithm in C/C++/some other language?

edit: I should have mentioned that this isn't to test the efficiency of my code. I was asked during an interview how long a theoretical algorithm would take to run. I had to figure out how many FLOPS would be executed and multiply that by the time each operation would take to come up with a rough estimate. I just though it was an interesting question.


Solution

  • What you want to use is Agner Fog's software "Test programs for measuring clock cycles and performance monitoring". It's what he used to measure the latency and throughput of instructions to produce his famous instruction tables. It is well documented and includes device drivers (with instructions on how to install them) that you can use in your own code. This is especially useful because in order to measure certain quantities such as the real CPU frequency you need access to model specific registers (MSR) which user level code does not normally have access to.

    Edit: based on your edit of an interview question to estimate how much time it will take to run for a floating point intensive operation you can use this formula:

    time = efficiency * number_of_floating_point_operations / peak_flops.
    

    The peak flops per core for many processors can be found from this link flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

    Each of these quantities may be difficult to calculate/estimate but the efficiency is the most difficult since it can depend on many things such as:

    1. Is the algorithm compute or memory bound?
    2. How well can the algorithm use SIMD (e.g. SSE/AVX/FMA)?
    3. How well can the algorithm use MIMD (e.g. multiple cores)?
    4. How well does your implementation use the different cache levels?

    To make this more clear let's consider two algorithms: matrix multiplication and the the dot product. It's easy to calculate the number of floating point operations for these two algorithms. The number of floating point operations for matrix multiplication is 2*n*n*n. For the dot product it's 2*n.

    Matrix multiplication if done right it is compute bound can fully benefit from SIMD and MIMD. It's efficiency will start low for small n and plateau for large n. My own implementation gets up to 75%. The Intel MKL gets over 95% (but less than 90% using FMA).

    So a back-of-the-envelop estimate of the time for matrix-multiplication for large n is to assume 100% efficiency giving time = 2*n^3/peak_flops.

    However, for the dot product the efficiency will start high for small n and drop to a plateau for large n. That's because it's memory bound. So for large n the efficiency is determined by how fast memory can be read. For a modern machine that's about 10 GB/s. Since a modern desktop with four cores will have a peak flops over 100 GLOPS and floats are 4 or 8 bytes I would estimate the efficiency for large n close to 1% giving time = 0.01*n/peak_flops. I went ahead and tested this (see the code below). I get about 2.2 GLOPS on my system which has a peak of 236 GFLOPS so that's about 1% of the peak. The bandwidth of my system is about 11 GB/s.

    Most algorithms are memory bound so knowing how fast your system can read memory (DDR3, DDR4,...) is one of the most useful metrics to estimate the time.

    So in general if you know the number of floating point operations of an algorithm and the peak flops of your processor the first thing you should ask is if the algorithm is compute bound or memory bound for large n then for a back-of-the-envelope estimate of the time I would assume the efficiency was 100% for compute bound and for memory bound I would look up the bandwidth to estimate the efficiency.

    This code estimates the data rate and GFLOPS from the dot product.

        #include <time.h>
        #include <stdlib.h>
        #include <string.h>
        #include <stdio.h>
        #include <stdint.h>
    
        float dot(float *x, float *y, int n) {
            float sum = 0;
            for(int i=0; i<n; i++) {
                sum += x[i]*y[i];
            }
            return sum;
        }
    
        int main(){
            const int LEN = 1 << 28;
            float *x = new float[LEN];
            float *y = new float[LEN];
            for(int i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;}
    
            uint32_t size = 2*sizeof(float)*LEN;
    
            clock_t time0 = clock();
            float sum = dot(x,y,LEN);
            clock_t time1 = clock();
            double dtime = (double)(time1 - time0) / CLOCKS_PER_SEC;
            double rate = 1.0*size/dtime*1E-9;
            double flops = 2.0*LEN/dtime*1E-9;
            printf("sum %f, dtime %f, rate %f, flops %f\n", sum, dtime, rate,flops);
        }