Search code examples
c++arraystemplatestokenizeboost-spirit-qi

Is it possible to use "boost::spirit::qi::as_string" in a loop? If so, can someone help me find the solution?


I am splitting my string based on two delimiters so far, but I would like to extend this to a possibility where the number of delimiters is variable. Right now, I have this function:

void dac_sim::dac_ifs::dac_sim_subcmd_if::parse_cmd(std::string command, std::array<std::string, 2> delimiters)
{
  std::string str = command;
  std::vector< std::string > vec;

  auto it = str.begin(), end = str.end();
  bool res = boost::spirit::qi::parse(it, end,
    boost::spirit::qi::as_string[ *(boost::spirit::qi::char_ - delimiters[0] - delimiters[1]) ] % (boost::spirit::qi::lit(delimiters[0]) | boost::spirit::qi::lit(delimiters[1])),
    vec);

  std::cout << "Parsed:";
  for (auto const& s : vec)
    std::cout << " \"" << s << "\"";
    std::cout << std::endl;
} 

But now I want something more generic, via template for the array size, like this:

template <size_t N>
void dac_sim::dac_ifs::dac_sim_subcmd_if::parse_cmd(std::string command, std::array<std::string, N> delimiters)

In this case, how can I procceed?


Solution

  • Fold Expressions

    Can you use c++17? I'd use fold-expressions:

    auto parse_cmd(std::string_view str, auto const&... delim) {
        namespace qi = boost::spirit::qi;
        std::vector<std::string> vec;
    
        qi::parse(str.begin(), str.end(),
                  qi::as_string[*(qi::char_ - ... - delim)] % (qi::lit(delim) | ...) //
                      > qi::eoi,
                  vec);
    
        return vec;
    }
    

    Test it Live On Coliru

    for (auto input :
         {
             "",
             "|",
             "|,",
             "|,||",
             "foo||bar,qux,stux;net||more||||to,come",
         }) //
    {
        fmt::print("{:<30} -> {}\n", fmt::format("'{}'", input), parse_cmd(input, "||", ","));
    }
    

    Prints

    ''              -> [""]
    '|'             -> ["|"]
    '|,'            -> ["|", ""]
    '|,||'          -> ["|", "", ""]
    'foo||bar,qux,stux;net||more||||to,come' -> ["foo", "bar", "qux", "stux;net", "more", "", "to", "come"]
    

    But You Need Arrays?

    You can always use the index-sequence trick to transform into a parameter pack:

    template <size_t N>
    auto parse_cmd(std::string_view str, std::array<std::string, N> const& delims) {
        return [&]<size_t... I>(std::index_sequence<I...>) {
            return do_parse_cmd(str, delims[I]...);
        }(std::make_index_sequence<N>{});
    }
    

    Where do_parse_cmd is the function just shown above. Let's demo with ";" added as a third delimiter: Live On Coliru

    std::array<std::string, 3> delimiters{"||", ",", ";"};
    
    for (auto input :
         {
             "",
             "|",
             "|,",
             "|,||",
             "foo||bar,qux,stux;net||more||||to,come",
         }) //
    {
        fmt::print("{:<15} -> {}\n", fmt::format("'{}'", input), parse_cmd(input, delimiters));
    }
    

    Prints

    ''              -> [""]
    '|'             -> ["|"]
    '|,'            -> ["|", ""]
    '|,||'          -> ["|", "", ""]
    'foo||bar,qux,stux;net||more||||to,come' -> ["foo", "bar", "qux", "stux", "net", "more", "", "to", "come"]
    

    Note how stux;net is correctly split now.

    Problems

    • versions
    • semantic problems
    • flexibility

    Versions

    For one, the above requires c++17 for the fold-expressions, and the demos also liberally use c++20 features to make it all easy to demonstrate. If you don't have that, even the c++17 version will become a lot more tedious.

    Semantic problems

    There's an issue when the caller passes delimiters in a sub-optimal way. E.g., {":", ":|:"} won't work, but {":|:", ":"} will. That's because of the overlapping pattern. You would want to be smarter.

    Flexibility

    You might want to be able to have full-blown parser expression capability instead of fixed string literals. Let me postpone this for later

    Qi Symbols

    To support c++11 and solve the semantic issue, let's use qi::symbols:

    using tokens = std::vector<std::string>;
    
    template <size_t N> tokens
    parse_cmd(std::string const& str, std::array<std::string, N> const& delims) {
        namespace qi = boost::spirit::qi;
    
        qi::symbols<char> delim;
        for (auto& d : delims)
            delim += d;
    
        tokens vec;
        parse(str.begin(), str.end(), qi::as_string[*(qi::char_ - delim)] % delim > qi::eoi, vec);
        return vec;
    }
    

    This internally builds a Trie so the order in which delimiters are passed doesn't matter. The longest possible match will always match a single delim expression.

    With the same test: Live On Coliru (c++11)

    ''              -> [""]
    '|'             -> ["|"]
    '|,'            -> ["|", ""]
    '|,||'          -> ["|", "", ""]
    'foo||bar,qux,stux;net||more||||to,come' -> ["foo", "bar", "qux", "stux", "net", "more", "", "to", "come"]
    

    Future Proofing

    To be completely flexible and compose the parser from any parser expression, you would have to thread the needle in Qi, and get considerable compile times:

    Suffice it to say, I won't recommend it. However, using X3¹ none of this is hard, and you could easily achieve it

    Identical X3 version

    Live On Coliru. 'Nuff said

    Generalize (Computer, Enhance!)

    Basically replacing std::string with auto in the fold-expression variant:

    auto parse_cmd(std::string const& str, auto... delims) {
        tokens vec;
        parse(str.begin(), str.end(),
              *(x3::char_ - ... - x3::as_parser(delims)) //
                      % (x3::as_parser(delims) | ...)    //
                  > x3::eoi,
              vec);
        return vec;
    }
    

    Now you can do funky stuff, like: Live On Coliru

    static constexpr auto input = "foo (false) bar (   true ) qux (4.8e-9) <!-- any comment --> quz";
    fmt::print("input: '{}'\n", input);
    
    auto test = [](auto name, auto... p) {
        fmt::print("{:>5}: {}\n", name, parse_cmd(input, p...));
    };
    
    constexpr auto d = "(" >> x3::double_ >> ")";
    constexpr auto b = x3::skip(x3::blank)["(" >> x3::bool_ >> ")"];
    constexpr auto x = "<!--" >> *(x3::char_ - "-->") >> "-->";
    
    test("d", d);
    test("b", b);
    test("x", x);
    test("x|b|d", x, b, d);
    

    Printing

    input: 'foo (false) bar (   true ) qux (4.8e-9) <!-- any comment --> quz'
        d: ["foo (false) bar (   true ) qux ", " <!-- any comment --> quz"]
        b: ["foo", " bar", " qux (4.8e-9) <!-- any comment --> quz"]
        x: ["foo (false) bar (   true ) qux (4.8e-9) ", " quz"]
    x|b|d: ["foo", " bar", " qux ", " ", " quz"]
    

    Summary/TL;DR

    Combining parsers in X3 is a joy, and crazy powerful. It will typically still be faster to compile than the Qi parsers.

    Note that at no point in this answer did I question why you are reinventing tokenization using a (checks notes) parser generator. Perhaps you should tell me what you're actually building or parsing, and I could give you some real advice on how to use Spirit for great success :)


    ¹ which is c++14 only and will become c++17 only in the future