I am timing some simple numerical operations to decide how (i.e., using what tools from what library) I am going to implement a computationally intensive simulation. The code below computes the sum of two inner products of vectors using (1) MTL4 version 4.0.9486 as is (i.e., without BLAS
), (2) std::vector
s and std::inner_product
, and (3) std::valarray
s. I picked a toy example of this specific form because it seemed like ideal ground for MTL4's expression templates.
To narrow everything down to a single question, is the following comparison fair or it puts (unintentionally) any one of the three approaches at disadvantage? I was a bit surprised that (2) is faster than (1). Whether the overall simulation will be faster or not is a different story, of course.
If anyone has any suggestions for more thorough tests that might reveal the strengths or weakness of each approach, I am eager to try them out.
Pardon the macros in the code; they are just std::cout<<
statements and calls to <chrono>
utilities.
Thanks in advance.
C++ code:
#include <iostream>
#include <valarray>
#include <vector>
#include <algorithm>
#include <boost/numeric/mtl/mtl.hpp>
int main(int argc, const char * argv[])
{
/* DOT PRODUCTS */
constexpr int trials{15};
std::vector<double> mtl_times(trials, 0.0), stl_times(trials, 0.0), valarray_times(trials, 0.0);
constexpr size_t sz{10000000};
double val = M_PI;
mtl::dense_vector<double> m(sz, val), n(sz, val), p(sz, val), q(sz, val);
std::vector<double> y(sz, val), z(sz, val), v(sz, val), w(sz, val);
std::valarray<double> b(val, sz), c(val, sz), d(val, sz), e(val, sz);
double x{0.0}, k{0.0}, aa{0.0};
auto t0 = NOW
auto t1 = t0;
for (int i = 0; i < trials; ++i) {
// MTL4 vectors
t0 = NOW // call now() from <chrono>
k = dot(m, n) + dot(p, q);
t1 = NOW
mtl_times[i] = DURATIONm // duration cast of (t1-t0).count()
// STL vectors
t0 = NOW
x = std::inner_product(y.begin(), y.end(), z.begin(), 0.0) + std::inner_product(v.begin(), v.end(), w.begin(), 0.0);
t1 = NOW
stl_times[i] = DURATIONm
// valarrays
t0 = NOW
aa = (b*c + d*e).sum();
t1 = NOW
valarray_times[i] = DURATIONm
}
std::cout << "MTL4: average time for dot product = " << std::accumulate(mtl_times.begin(), mtl_times.end(), 0.0)/mtl_times.size() << " msec\n";
PRINTV(mtl_times)
PRINTME(result, k)
std::cout << '\n';
std::cout << "STL vectors + std::inner_product: average time for dot product = " << std::accumulate(stl_times.begin(), stl_times.end(), 0.0)/stl_times.size() << " msec\n";
PRINTV(stl_times)
PRINTME(result, x)
std::cout << '\n';
std::cout << "valarrays: average time for dot product = " << std::accumulate(valarray_times.begin(), valarray_times.end(), 0.0)/valarray_times.size() << " msec\n";
PRINTV(valarray_times)
PRINTME(result, aa)
return 0;
}
C++ Output:
MTL4: average time for dot product = 180.333 msec
mtl_times = 177 175 174 174 175 178 176 185 184 174 175 179 175 216 188
result: 1.97392e+08
STL vectors + std::inner_product: average time for dot product = 58.6 msec
stl_times = 56 55 56 57 57 56 57 56 57 55 55 58 56 58 90
result: 1.97392e+08
valarrays: average time for dot product = 64.4 msec
valarray_times = 63 64 63 64 65 63 63 63 64 63 63 63 64 64 77
result: 1.97392e+08
For the record, MatLab performs well:
MatLab code:
trials = 15;
times_ms = zeros(1, trials);
sz = 1e7;
val = pi;
x(sz) = val;
x(1:end-1) = val;
y(sz) = val;
y(1:end-1) = val;
v(sz) = val;
v(1:end-1) = val;
w(sz) = val;
w(1:end-1) = val;
z = 0;
for i = 1:trials
tic
z = x*y' + v*w';
times_ms(i) = toc*1e3;
end
avg_time = sum(times_ms)/length(times_ms)
times_ms
z
MatLab output:
avg_time = 56.0687 msec
times_ms = 56.8919 57.2052 55.3179 55.5126 55.7660 55.3982 55.1044 55.4809 57.7229 56.1902 57.3888 56.5263 55.2830 55.4926 55.7501
z = 1.9739e+08
This is not surprising since built-in operations are optimised, however there are other obstacles associated with using MatLab in the simulation.
Computing dot products over and over will probably be memory-bound. It'd be better to compare something like matrix multiplication if you're trying to get a rough sense of the speed differences you can expect. Dot products are also straightforward enough that you can just inspect the assembly code to see what's going on; I'd encourage you to do this.
Your comparison is slightly unfair to valarrays; you make two arrays, then you add them together, then you take their sum. It's probably better to compute the two sums and add them. (I don't know how to avoid the extra scan of the whole array using the valarray interface.)