Search code examples
c++boostboost-spirit-qi

Specify a charset without intepreting ranges


I'm quite puzzled with parsing strings when I have to define in rule the minus and it is just a minus character and not a range of characters between two endpoints.

For example, when you write a rule to percent encode a string of characters you normally would write

*(bk::char_("a-zA-Z0-9-_.~") | '%' << bk::right_align(2, 0)[bk::upper[bk::hex]]);

Which normally means "letters, capital letters, digits, minus sign, underscore, dot and tilde", but the third minus sign would create a range between 9 and underscore or something, so you have to put the minus at the end bk::char_("a-zA-Z0-9_.~-").

It solves current problem but what would one do when the input is dynamic, like user input, and minus sign just means minus character?

How do I prevent from Spirit assign a special meaning to any of possible characters?

EDIT001: I resort to more concrete example from @sehe answer

void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
    result.clear();
    using namespace bsq;
    if(!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
        result.push_back(input);
}

in case you want to ensure the minus is treated as minus and not a range one would to alter the code as following (according to @sehe proposal below).

void spirit_direct(std::vector<std::string>& result, const std::string&
    input, char const* delimiter)
{
    result.clear();
    bsq::symbols<char, bsq::unused_type> sym_;
    std::string separators = delimiter;
    for(auto ch : separators)
    {
        sym_.add(std::string(1, ch));
    }
    using namespace bsq;
    if(!parse(input.begin(), input.end(), raw[*(char_ - sym_)] % sym_, result))
        result.push_back(input);
}

Which looks quite elegant. In case of using static constant rule I guess I can escape characters with '\', square brackets were meant as one of those "special" characters which need to be escaped. Why? what is the meaning of []? Is there any additional characters to escape?


Solution

  • Simple.

    You devise and specify the supported patterns that the user can supply with their meanings.

    Next,

    • you write the code that transforms it into a character-set (e.g. expand all ranges (if supported in user input) and sort the - to be the first character by definition).

    • do not use a character set at all.

      • why not use char_ [ _pass = my_match_predicate(_1) ]
      • why not just make an alternation of literal characters? lit('a') | 'b' | '-' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
      • why not use qi::symbols<char, char> (or even qi::symbols<char, qi::unused_type> sym_; with raw [ sym_ ] or similar)

        Update The qi::symbols<> approach is surprisingly fast: Live On Coliru. I had a recent optimization job where it disappointed: see this answer (under "Spirit (Trie)") – Binary String to Hex c++

    In general, I don't know what you're trying to achieve, but Spirit is not well-suited for generating rules on the fly. See some of my existing answers on this site.