Search code examples
c++parsingboostboost-spirit-qi

Parsing delimited list of tokens using Boost Spirit Qi


Using boost::spirit::qi I'm trying to parse lines consisting of a label followed by a variable number of delimited tokens. I'm calling the grammar with phrase_parse and using the provided blank parser as skip parser to preserve newlines as I need to make sure the label is the first item on each line.

The simple base case:

label token, token, token

Can be parsed with the grammar:

line = label >> (token % ',') >> eol;

The problem I am facing is that the grammar should accept zero or more tokens and that tokens may be empty. The grammar should accept the following lines:

label
label ,
label , token
label token, , token,

I have not managed to create a grammar that accepts all examples above. Any suggestions on how to solve this?

Edit:

Thanks to sehe for all input on the problem stated above. Now for the fun part that I forgot to include... The grammar should also accept empty lines and split lines. (tokens without a label) When I try to make the label optional, I get an infinite loop matching the empty string.

label

label token
token

Solution

  • You should be able to accept the empty list with

    line = label >> -(token % ',') >> eol;
    

    Note that eol won't work if your skipper skips eol too (so don't use qi::space but e.g. qi::blank for this purpose)

    Also, depending on the definition of token you should maybe change it to accept the "empty" token as well


    In response to the comment: a fully working sample Live On Coliru

    #include <boost/spirit/include/qi.hpp>
    
    namespace qi = boost::spirit::qi;
    
    int main()
    {
        using namespace qi;
    
        using It     = std::string::const_iterator;
        using Token  = std::string;
        using Tokens = std::vector<Token>;
    
        rule<It, blank_type> label 
            = lexeme[+~char_(":")] >> ':'
            ;
    
        rule<It, Token(), blank_type> token
            = lexeme[*~char_(",\n")];
            ;
    
        rule<It, Tokens(), blank_type> line
            = label >> -(token % ',') >> eol
            ;
    
        for (std::string const input : {
            "my first label: 123, 234, 345 with spaces\n",
            "1:\n",
            "2: \n",
            "3: ,,,\n",
            "4: ,  \t ,,\n",
            "5: ,  \t , something something,\n",
        })
        {
            std::cout << std::string(40, '=') << "\nparsing: '" << input << "'\n";
    
            Tokens parsed;
            auto f = input.begin(), l = input.end();
            bool ok = phrase_parse(f, l, line, blank, parsed);
    
            if (ok)
            {
                std::cout << "Tokens parsed successfully, number parsed: " << parsed.size() << "\n";
                for (auto token : parsed)
                    std::cout << "token value '" << token << "'\n";
            }
            else
                std::cout << "Parse failed\n";
    
            if (f != l)
                std::cout << "Remaining input: '" << std::string(f, l) << "'\n";
        }
    }
    

    Output:

    ========================================
    parsing: 'my first label: 123, 234, 345 with spaces
    '
    Tokens parsed successfully, number parsed: 3
    token value '123'
    token value '234'
    token value '345 with spaces'
    ========================================
    parsing: '1:
    '
    Tokens parsed successfully, number parsed: 1
    token value ''
    ========================================
    parsing: '2: 
    '
    Tokens parsed successfully, number parsed: 1
    token value ''
    ========================================
    parsing: '3: ,,,
    '
    Tokens parsed successfully, number parsed: 4
    token value ''
    token value ''
    token value ''
    token value ''
    ========================================
    parsing: '4: ,       ,,
    '
    Tokens parsed successfully, number parsed: 4
    token value ''
    token value ''
    token value ''
    token value ''
    ========================================
    parsing: '5: ,       , something something,
    '
    Tokens parsed successfully, number parsed: 4
    token value ''
    token value ''
    token value 'something something'
    token value ''