Search code examples
c++c++11matching

Why is std::cmatch slower than std::smatch here?


I generate a long random string first:

const int length = 100000;
std::uniform_int_distribution<int> distribution(0, 2);
std::default_random_engine engine{1}; // set 1 as seed

// Just for test usage, not optimal. 
for(int i = 0; i < length; i++) // random abc
    a.push_back('a' + distribution(engine));
std::regex r{ "abc" };

Then I use std::smatch and std::cmatch separately and benchmark them:

std::smatch m;
std::string a0 = a;
int result = 0; // to disable optimization.

while (std::regex_search(a0, m, r))
{
    a0 = m.suffix();
    result += static_cast<int>(a0[0]);
}
return result;
std::cmatch m;
const char* currBegin = a.c_str();
int result = 0;

while (std::regex_search(currBegin, m, r))
{
    // For practical use in the future.
    std::string_view v(m[0].first, m[0].second - m[0].first);
    currBegin = m.suffix().first;
    result += static_cast<int>(*currBegin);
}
return result;

The cmatch one is slower than the smatch one by about five times; why?

Notice that I use BENCHMARK of Catch2 to get the evaluation, with msvc 19.29, release mode and C++ standard as c++20.


Solution

  • Oh, I read the source code of std::regex_search and I find that providing const char* to std::regex_search will cause a strlen-like operation first. So I get the expected result after I change the following line:

    while (std::regex_search(currBegin, m, r))
    

    to

    while (std::regex_search(currBegin, currEnd, m, r))
    

    where currEnd = currBegin + a.size(). With an indicator of end, the strlen-like operation is omitted and I get a 50% speedup in std::cmatch. Counting the valid characters over and over again drags the whole process.