Why copy_if works slower then copy
I'm currently working on my OpenGL graphics engine. And I was trying to figure out the best way to pass big amount of objects to GPU for instanced drawing. The biggest problem for me is that some objects can be dead, so I created a small test.
Here is a simple struct that I was testing (in real app it would be position + colors, etc.)
struct foo
{
bool is_active = false;
float value = 0.0f;
};
After this I created these containers:
// All data
std::vector<foo> data_vector;
// Data that is only active
std::vector<foo> active_vector;
using distance_t = vector<foo>::iterator::difference_type;
// List of segments, so that if we have 10 elements where
// only the 5th is not active it is going to look like that
// { {0,5}, {6, 10} }
std::list<pair<distance_t, distance_t>> active_segments;
Reserved space for 1,000,000 elements in vectors. Filled data_vector with all true values. Filled the list as well in order to ignore allocation time. And tested the speed of these 3 copy functions with high_resolution_clock
// First method
// For all true values *active_segments* has only one element with
// {0, 1000000}
for_each(active_segments.begin(), active_segments.end(),
[&active_vector, data_vector](auto current)
{
copy(data_vector.begin() + current.first,
data_vector.begin() + current.second,
std::back_inserter(active_vector));
});
// Second method
copy_if(data_vector.begin(), data_vector.end(),
std::back_inserter(active_vector),
[](const foo ¤t)
{
return current.is_active;
});
// Third method
copy(data_vector.begin(), data_vector.end(),
std::back_inserter(active_vector));
Obviously copy was the fastest one with 18024 microseconds. But I was surprised that copy_if is faster(27777 microseconds), then the first method(33278 microseconds).
I don't understand why this is happening. I wanted to have some extra memory allocation but increased copy speed, but in result I have method that is slower even in the best conditions.
It appears to me that you have a combination of (at least) two factors leading to the problem.
The first is a real problem: in your lambda, you're capturing data_vector
by value instead of by reference, so you're copying the whole input array, then copy data from that copy to the result.
The second is mostly specific to benchmarking: cache warming. If I fix the lambda so its captures by reference, your method 1 still runs significantly slower than the other two methods. But if I add a simple cache-warming loop ahead of it:
for (int i = 0; i < size; i++)
active_vector.push_back(data_vector[i]);
...then I can run all three after that, and they all run at close enough to the same speed that I can no longer be certain that one is faster than another.
On the other hand, I believe this also indicates that the whole exercise is kind of pointless--although copy_if
should theoretically be a tiny bit slower than copy
(on a per-element basis), I can't find any significant difference between the two. I suspect in most cases, memory bandwidth is the limiting factor, and the extra processing time to figure out whether to copy something is simply lost in the noise. In fact, sometimes, the second version (using copy_if
) is coming out the fastest, and the third (using copy
) the slowest:
method 1: 3,295us
method 2: 3,178us
method 3: 3,839us
Just for what it's worth, here's the code as I ran it:
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
#include <list>
#include <utility>
struct foo
{
bool is_active = true;
float value = 0.0f;
};
int main() {
const int size = 1'000'000;
std::cout.imbue(std::locale(""));
// All data
std::vector<foo> data_vector(size);
// Data that is only active
std::vector<foo> active_vector;
using distance_t = std::vector<foo>::iterator::difference_type;
// List of segments, so that if we have 10 elements where
// only the 5th is not active it is going to look like that
// { {0,5}, {6, 10} }
std::vector<std::pair<distance_t, distance_t>> active_segments;
using namespace std::chrono;
// Warm the cache:
for (int i = 0; i < size; i++)
active_vector.push_back(data_vector[i]);
{
active_segments.emplace_back(0, size);
active_vector.clear();
active_vector.reserve(size);
auto begin = high_resolution_clock::now();
for_each(active_segments.begin(), active_segments.end(),
[&active_vector, &data_vector](auto current)
{
copy(data_vector.begin() + current.first,
data_vector.begin() + current.second,
std::back_inserter(active_vector));
});
auto end = high_resolution_clock::now();
std::cout << "method 1: " << duration_cast<microseconds>(end - begin).count() << "us\n";
}
{
active_vector.clear();
active_vector.reserve(size);
auto begin = high_resolution_clock::now();
// Second method
copy_if(data_vector.begin(), data_vector.end(),
std::back_inserter(active_vector),
[](const foo ¤t)
{
return current.is_active;
});
auto end = high_resolution_clock::now();
std::cout << "method 2: " << duration_cast<microseconds>(end - begin).count() << "us\n";
}
{
active_vector.clear();
active_vector.reserve(size);
auto begin = high_resolution_clock::now();
copy(data_vector.begin(), data_vector.end(),
std::back_inserter(active_vector));
auto end = high_resolution_clock::now();
std::cout << "method 3: " << duration_cast<microseconds>(end - begin).count() << "us\n";
}
}
There is one more point that should probably be considered: do you still need the elements in data_vector
that are no longer active? If you no longer need them, you can use std::remove_if
to move all the active elements to the beginning of the collection, and then erase from there to the end.
auto e = std::remove_if(data_vector.begin(), data_vector.end(),
[](auto const &e) {return e.is_active; });
data_vector.erase(e, data_vector.end());
A quick test with a 50% chance of each element being marked active or inactive shows this running around twice the speed of copying the active elements.