c++visual-c++c++11 parallel-processing directxmath

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.

Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...

template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
    out_type cost = 0;
    for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i) 
    {
        const int index = indices.size() * indices[i] + indices[i + 1];
        cost += matrix[index];
    }

    const int index = indices.size() * indices[indices.size() - 1] + indices[0];
    cost += matrix[index];

    return cost;
}

template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
    out_type cost = 0;
    for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++) 
    {
        cost += matrix[indices[i]][indices[i + 1]];
    }
    cost += matrix[indices[indices.size() - 1]][indices[0]];

    return cost;
}

Oh, and PPL/TBB are fair game too.

Edit

As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.

Edit 2

Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.

Edit 3

Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.

DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.

This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.

And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.

That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.

Solution

You could produce an iterator that takes matrix and indices and yields the appropriate values.

class route_iterator
{
  vector<vector<float>> const& matrix;
  vector<int> const& indices;
  int i;

public:
  route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
                 int begin = 0)
  : matrix(matrix_), indices(indices_), i(begin)
  { }
  float operator*() {
    return matrix[indices[i]][indices[(i + 1) % indices.size()]];
  }
  route_iterator& operator++() {
    ++i;
    return *this;
  }
};

Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).

Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.