I have two vectors v1
and v2
of type std::vector<std::string>
. Both vectors have unique values and should compare equal if values compare equal but independent of the order values appear in the vector.
I assume two sets of type std::unordered_set
would have been a better choice, but I take it as it is, so two vectors.
Nevertheless, I thought for the needed order insensitive comparison I'll just use operator==
from std::unordered_set
by copying to two std::unordered_set
. Very much like this:
bool oi_compare1(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
std::unordered_set<std::string> tmp1(v1.begin(),v1.end());
std::unordered_set<std::string> tmp2(v2.begin(),v2.end());
return tmp1 == tmp2;
}
While profiling I noticed this function consuming a lot of time, so I checked doc and saw the O(n*n)
complexity here. I am confused, I was expecting O(n*log(n))
, like e.g. for the following naive solution I came up with:
bool oi_compare2(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
if(v1.size() != v2.size())
return false;
auto tmp = v2;
size_t const size = tmp.size();
for(size_t i = 0; i < size; ++i)
{
bool flag = false;
for(size_t j = i; j < size; ++j)
if(v1[i] == tmp[j]){
flag = true;
std::swap(tmp[i],tmp[j]);
break;
}
if(!flag)
return false;
}
return true;
}
Why the O(n*n)
complexity for std::unordered_set
and is there a build in function I can use for order insensitive comparision?
EDIT---- BENCHMARK
#include <unordered_set>
#include <chrono>
#include <iostream>
#include <vector>
bool oi_compare1(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
std::unordered_set<std::string> tmp1(v1.begin(),v1.end());
std::unordered_set<std::string> tmp2(v2.begin(),v2.end());
return tmp1 == tmp2;
}
bool oi_compare2(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
if(v1.size() != v2.size())
return false;
auto tmp = v2;
size_t const size = tmp.size();
for(size_t i = 0; i < size; ++i)
{
bool flag = false;
for(size_t j = i; j < size; ++j)
if(v1[i] == tmp[j]){
flag = true;
std::swap(tmp[i],tmp[j]);
break;
}
if(!flag)
return false;
}
return true;
}
int main()
{
std::vector<std::string> s1{"1","2","3"};
std::vector<std::string> s2{"1","3","2"};
std::cout << std::boolalpha;
for(size_t i = 0; i < 15; ++i)
{
auto tmp1 = s1;
for(auto &iter : tmp1)
iter = std::to_string(i)+iter;
s1.insert(s1.end(),tmp1.begin(),tmp1.end());
s2.insert(s2.end(),tmp1.begin(),tmp1.end());
}
std::cout << "size1 " << s1.size() << std::endl;
std::cout << "size2 " << s2.size() << std::endl;
for(auto && c : {oi_compare1,oi_compare2})
{
auto start = std::chrono::steady_clock::now();
bool flag = true;
for(size_t i = 0; i < 10; ++i)
flag = flag && c(s1,s2);
std::cout << "ms=" << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count() << " flag=" << flag << std::endl;
}
return 0;
}
gives
size1 98304
size2 98304
ms=844 flag=true
ms=31 flag=true
--> naive approach way faster.
For all the Complexity O(N*N) experts here...
Let me go through this naive approach. I have two loops there. The first loop is running from i=0
to size which is N. The inner loop is called from j=i!!!!!! to N. In language spoken it means I call the Inner loop N times. But the complexity of the inner loop is log(n) due to the starting index of j = i !!!!. If you still dont believe me calculate the complexity from benchmarks and you will see...
EDIT2--- LIVE ON WANDBOX https://wandbox.org/permlink/v26oxnR2GVDb9M6y
I'm sorry to tell you, your benchmark of operator==
is faulty.
oi_compare1
accepts 2 vectors and needs to build up 2 complete unordered_set
instances, to than call operator==
and destroy the complete bunch again.
oi_compare2
also accepts 2 vectors, and immediately uses them for the comparison on size. Only copies 1 instance (v2 to tmp), which is much more performant for a vector.
Looking at the documentation: https://en.cppreference.com/w/cpp/container/unordered_set/operator_cmp we can see the expected complexity:
Proportional to N calls to operator== on value_type, calls to the predicate returned by key_eq, and calls to the hasher returned by hash_function, in the average case, proportional to N2 in the worst case where N is the size of the container.
edit
There is a simple algorithm, you can loop over the unordered_set
and do a simple lookup in the other one. Without hash collisions, it will find each element in it's own internal bucket and compare it for equality as the hashing ain't sufficient.
Assuming you don't have hash collisions, each element of the This doesn't work when you have different sizes of the buckets you store the values in, or when the assignment of buckets uses a different calculation to deal with collisions.unordered_set
has a stable order in which they are stored. One could loop over the internal buckets and compare the elements 2-by-2 (1st of the one with the 1st of the second, 2nd of the one with the 2nd of the second ...). This nicely gives O(N)
.
Assuming you are unlucky and every element results into the same hash. (Known as hash flooding
) You result in a list of elements without order. To compare, you have to check for each element if it exists in the other one, causing O(N*N)
.
This last one is easy reproducible if you rig your hash to always return the same number. Build the one set in the reverse order as the other one.