Search code examples
boostboost-xpressive

Xpressive : much slower regex search when expression is built from sub expression


Using Boost Xpressive (static expression) , I noticed that pattern searching is much slower when the expression is built from sub regexpression.

Did I miss something ? or is it inherent with the design ? Xpresive docs says https://www.boost.org/doc/libs/1_80_0/doc/html/xpressive/user_s_guide.html#boost_xpressive.user_s_guide.grammars_and_nested_matches.embedding_a_regex_by_value

it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex participates fully in the match, back-tracking as needed to make the match succeed.

Consider these 2 ways of defining a regexp matching an uri (probably sub-optimal and not 100%, but the point is not on this topic).

If the expression is defined in one go, execution is around 6x faster than if the same regex is built from 3 sub regex.

Consider this code snippet

#include <iostream>
#include <string>
#include <chrono>
#include <boost/xpressive/xpressive.hpp>

using namespace boost::xpressive;

void bench_regex(const sregex& regex)
{
    std::string positive = "asdas http://www.foo.io/bar asdp https://www.bar.io ";
    std::string negative = "sdaoas dof jdfjo fds dsf http:/www.nonono .sa ";
    const int nb_iterations = 100'000;
    int nmatch = 0;
    smatch what;

    std::chrono::steady_clock::time_point begin0 = std::chrono::steady_clock::now();
 
    for (int i = 0 ; i < nb_iterations; ++i)
    {
        if (regex_search( positive, what,  regex ))
            nmatch++;
        if (regex_search( negative, what,  regex ))
            nmatch++;
    }
    std::chrono::steady_clock::time_point end0 = std::chrono::steady_clock::now();
    std::cout << "nb matchs " << nmatch << std::endl;
    std::cout << "search time " << std::chrono::duration_cast<std::chrono::microseconds>(end0-begin0).count()/1000.0f <<"ms" <<  std::endl << std::endl;
}

int main()
{
    {
        std::cout << "regex in one piece" << std::endl;
        const sregex regex_uri_standalone = alpha >>  *alnum  >> "://" >> + ~(set= ' ','/') >> !( *('/' >>  ~(set=' ')));
        bench_regex(regex_uri_standalone);
    }
    {
        std::cout << "regex built from part" << std::endl;
        const sregex scheme = alpha >> *alnum;
        const sregex hostname = + ~(set= ' ','/');
        const sregex path = !( *('/' >>  ~(set=' ')));

        const sregex regex_uri_built_from_subregex = scheme >> "://" >> hostname >> path;
        bench_regex(regex_uri_built_from_subregex);
    }
}

This is particularly annoying because a main force of Xpressive is the ability to construct complex regexp from simplier one, which can be quickly become a nightmare if using pcre or equivalent. But if it comes with such a performance cost, the benefit looks annihilated.

btw, is the library still maintained ? according to boost changelog, no change since boost 1.55 (11Nov 2013 !) https://www.boost.org/users/history/


Solution

  • No you can't get around the function invocation / type erasure overhead of sregex instance, because the expression template has already been compiled.

    What you can do instead, is use deduced type for the sub-expressions:

    using boost::proto::deep_copy;
    std::cout << "regex built from auto" << std::endl;
    const auto scheme   = deep_copy(xp::alpha >> *xp::alnum);
    const auto hostname = deep_copy(+~(xp::set = ' ', '/'));
    const auto path     = deep_copy(!(*('/' >> ~(xp::set = ' '))));
    

    Be aware that the deep_copy is absolutely necessary to avoid dangling references to temporaries since you're naming the expressions now. The good news is that on my system, the result is slightly faster than before:

    Live On Coliru

    Printing

    regex in one piece nb matchs 100000 search time 180.01ms
    regex built from auto nb matchs 100000 search time 170.172ms

    But Let's Speed It Up

    There's Boost Spirit, which has a very similar parser expression language. I think it's just simply more modern. Let's try and compare!

    Live On Coliru

    #include <chrono>
    #include <iostream>
    #include <string>
    
    double do_xp_test();
    double do_x3_test();
    
    int main() {
        auto xp = do_xp_test();
        auto x3 = do_x3_test();
        std::cout << "x3 took ~" << (x3/xp*100.0) << "% of the xpressive time\n";
    }
    
    auto now = std::chrono::steady_clock::now;
    using namespace std::chrono_literals;
    
    static std::string const positive = "asdas http://www.foo.io/bar asdp https://www.bar.io ";
    static std::string const negative = "sdaoas dof jdfjo fds dsf http:/www.nonono .sa ";
    constexpr int            nb_iterations = 100'000;
    
    #include <boost/xpressive/xpressive.hpp>
    
    namespace xp = boost::xpressive;
    
    double bench_regex(const xp::sregex& regex) {
        unsigned          nmatch        = 0;
        xp::smatch        what;
    
        auto begin0 = now();
    
        for (int i = 0; i < nb_iterations; ++i) {
            if (regex_search(positive, what, regex))
                nmatch++;
            if (regex_search(negative, what, regex))
                nmatch++;
        }
    
        auto elapsed = (now() - begin0) / 1.0ms;
        std::cout << "nb matchs " << nmatch << "\telapsed " << elapsed << "ms\n";
        return elapsed;
    }
    
    double do_xp_test() {
        using boost::proto::deep_copy;
        std::cout << "regex built from auto\t";
        const auto scheme   = deep_copy(xp::alpha >> *xp::alnum);
        const auto hostname = deep_copy(+~(xp::set = ' ', '/'));
        const auto path     = deep_copy(!(*('/' >> ~(xp::set = ' '))));
    
        const xp::sregex regex_uri_built_from_subregex =
            scheme >> "://" >> hostname >> path;
        return bench_regex(regex_uri_built_from_subregex);
    }
    
    #include <boost/spirit/home/x3.hpp>
    namespace x3 = boost::spirit::x3;
    
    double bench_x3(auto parser_expression) {
        auto const search_expr = x3::seek[x3::raw[parser_expression]];
    
        [[maybe_unused]] std::string what;
        auto begin0 = now();
    
        unsigned nmatch = 0;
        for (int i = 0; i < nb_iterations; ++i) {
            if (parse(begin(positive), end(positive), search_expr/*, what*/))
                nmatch++;
            if (parse(begin(negative), end(negative), search_expr/*, what*/))
                nmatch++;
        }
    
        auto elapsed = (now() - begin0) / 1.0ms;
        std::cout << "nb matchs " << nmatch << "\telapsed " << elapsed << "ms\n";
        return elapsed;
    }
    
    double do_x3_test() {
        std::cout << "spirit x3 from subs\t";
        const auto scheme   = x3::alpha >> *x3::alnum;
        const auto hostname = +~x3::char_(" /");
        const auto path     = *('/' >> +~x3::char_(' '));
    
        const auto uri_built_from_subs = scheme >> "://" >> hostname >> path;
        return bench_x3(uri_built_from_subs);
    }
    

    Prints

    regex built from auto   nb matchs 100000    elapsed 156.939ms
    spirit x3 from subs nb matchs 100000    elapsed 93.3622ms
    x3 took ~59.4897% of the xpressive time
    

    Or on my system, enter image description here

    Related: Boost URL

    For actual handling of URLs (not detection) I suggest using the newly accepted Boost URL library.