Timing numerical operations on large vectors. How to draw a fair comparison?

I am timing some simple numerical operations to decide how (i.e., using what tools from what library) I am going to implement a computationally intensive simulation. The code below computes the sum of two inner products of vectors using (1) MTL4 version 4.0.9486 as is (i.e., without BLAS), (2) std::vectors and std::inner_product, and (3) std::valarrays. I picked a toy example of this specific form because it seemed like ideal ground for MTL4's expression templates.

To narrow everything down to a single question, is the following comparison fair or it puts (unintentionally) any one of the three approaches at disadvantage? I was a bit surprised that (2) is faster than (1). Whether the overall simulation will be faster or not is a different story, of course.

If anyone has any suggestions for more thorough tests that might reveal the strengths or weakness of each approach, I am eager to try them out.

Pardon the macros in the code; they are just std::cout<< statements and calls to <chrono> utilities.

Thanks in advance.

C++ code:

#include <iostream>
#include <valarray>
#include <vector>
#include <algorithm>
#include <boost/numeric/mtl/mtl.hpp>

int main(int argc, const char * argv[])
{
    /* DOT PRODUCTS */
    constexpr int trials{15};
    std::vector<double> mtl_times(trials, 0.0), stl_times(trials, 0.0), valarray_times(trials, 0.0);

    constexpr size_t sz{10000000};
    double val = M_PI;
    mtl::dense_vector<double> m(sz, val), n(sz, val), p(sz, val), q(sz, val);
    std::vector<double> y(sz, val), z(sz, val), v(sz, val), w(sz, val);
    std::valarray<double> b(val, sz), c(val, sz), d(val, sz), e(val, sz);

    double x{0.0}, k{0.0}, aa{0.0};
    auto t0 = NOW
    auto t1 = t0;

    for (int i = 0; i < trials; ++i) {

        // MTL4 vectors
        t0 = NOW // call now() from <chrono>
        k = dot(m, n) + dot(p, q);
        t1 = NOW
        mtl_times[i] = DURATIONm // duration cast of (t1-t0).count()

        // STL vectors
        t0 = NOW
        x = std::inner_product(y.begin(), y.end(), z.begin(), 0.0) + std::inner_product(v.begin(), v.end(), w.begin(), 0.0);
        t1 = NOW
        stl_times[i] = DURATIONm

        // valarrays
        t0 = NOW
        aa = (b*c + d*e).sum();
        t1 = NOW
        valarray_times[i] = DURATIONm

    }

    std::cout << "MTL4: average time for dot product = " << std::accumulate(mtl_times.begin(), mtl_times.end(), 0.0)/mtl_times.size() << " msec\n";
    PRINTV(mtl_times)
    PRINTME(result, k)
    std::cout << '\n';

    std::cout << "STL vectors + std::inner_product: average time for dot product = " << std::accumulate(stl_times.begin(), stl_times.end(), 0.0)/stl_times.size() << " msec\n";
    PRINTV(stl_times)
    PRINTME(result, x)
    std::cout << '\n';

    std::cout << "valarrays: average time for dot product = " << std::accumulate(valarray_times.begin(), valarray_times.end(), 0.0)/valarray_times.size() << " msec\n";
    PRINTV(valarray_times)
    PRINTME(result, aa)

    return 0;
}

C++ Output:

MTL4: average time for dot product = 180.333 msec

mtl_times = 177 175 174 174 175 178 176 185 184 174 175 179 175 216 188

result: 1.97392e+08

STL vectors + std::inner_product: average time for dot product = 58.6 msec

stl_times = 56 55 56 57 57 56 57 56 57 55 55 58 56 58 90

result: 1.97392e+08

valarrays: average time for dot product = 64.4 msec

valarray_times = 63 64 63 64 65 63 63 63 64 63 63 63 64 64 77

result: 1.97392e+08

For the record, MatLab performs well:

MatLab code:

trials = 15;
times_ms = zeros(1, trials);

sz = 1e7;
val = pi;
x(sz) = val;
x(1:end-1) = val;
y(sz) = val;
y(1:end-1) = val;
v(sz) = val;
v(1:end-1) = val;
w(sz) = val;
w(1:end-1) = val;

z = 0;

for i = 1:trials

    tic
    z = x*y' + v*w';
    times_ms(i) = toc*1e3;

end

avg_time = sum(times_ms)/length(times_ms)
times_ms
z

MatLab output:

avg_time = 56.0687 msec

times_ms = 56.8919 57.2052 55.3179 55.5126 55.7660 55.3982 55.1044 55.4809 57.7229 56.1902 57.3888 56.5263 55.2830 55.4926 55.7501

z = 1.9739e+08

This is not surprising since built-in operations are optimised, however there are other obstacles associated with using MatLab in the simulation.

Solution

Computing dot products over and over will probably be memory-bound. It'd be better to compare something like matrix multiplication if you're trying to get a rough sense of the speed differences you can expect. Dot products are also straightforward enough that you can just inspect the assembly code to see what's going on; I'd encourage you to do this.

Your comparison is slightly unfair to valarrays; you make two arrays, then you add them together, then you take their sum. It's probably better to compute the two sums and add them. (I don't know how to avoid the extra scan of the whole array using the valarray interface.)