Search code examples
c++boostwhitespaceboost-spiritboost-spirit-qi

Boost spirit parsing string with leading and trailing whitespace


I am still new to Boost spirit.

I am trying to parse a string with possible lead and trailing whitespace and intermediate whitespace. I want to do the following with the string

  1. Remove any trailing and leading whitespace
  2. Limit the in-between word spaces to one whitespace

For example

"(  my   test1  ) (my  test2)"

gets parsed as two terms -

"my test1" 
"my test2"

I have used the following logic

using boost::spirit::qi;
struct Parser : grammar<Iterator, attribType(), space_type>
{
   public:
     Parser() : Parser::base_type(term)
     {
         group  %= '(' >> (group | names) >> ')';
         names %= no_skip[alnum][_val=_1];
     }

  private:
    typedef boost::spirit::qi::rule<Iterator, attribType(), space_type> Rule;
    Rule group;
    Rule names
}

While it allows preserving the spaces in between. Unfortunately, it also keeps heading and trailing whitespace and multiple intermediate whitespace. I want to find a better logic for that.

I did see references to using a custom skipper with boost::spirit::qi::skip online, but I haven't come across a useful example for spaces. Does anyone else have experience with it?


Solution

  • I'd suggest doing the trimming/normalization after (not during) parsing.

    That said, you could hack it like this:

    name   %= lexeme [ +alnum ];
    names  %= +(name >> (&lit(')') | attr(' ')));
    group  %= '(' >> (group | names) >> ')';
    

    See it Live On Coliru

    Output:

    Parse success
    Term: 'my test1'
    Term: 'my test2'
    

    I introduced the name rule only for readability. Note that (&lit(')') | attr(' ')) is a fancy way of saying:

    If the next character matches ')' do nothing, otherwise, append ' ' to the synthesized attribute

    Full code:

    #define BOOST_SPIRIT_DEBUG
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/phoenix.hpp>
    
    namespace qi = boost::spirit::qi;
    namespace phx = boost::phoenix;
    
    using Iterator = std::string::const_iterator;
    
    using attribType = std::string;
    
    struct Parser : qi::grammar<Iterator, attribType(), qi::space_type>
    {
       public:
         Parser() : Parser::base_type(group)
         {
             using namespace qi;
    
             name   %= lexeme [ +alnum ];
             names  %= +(name >> (&lit(')') | eps [ phx::push_back(_val, ' ') ]));
             group  %= '(' >> (group | names) >> ')';
    
             BOOST_SPIRIT_DEBUG_NODES((name)(names)(group))
         }
    
      private:
        typedef boost::spirit::qi::rule<Iterator, attribType(), qi::space_type> Rule;
        Rule group, names, name;
    };
    
    
    int main()
    {
        std::string const input = "(  my   test1  ) (my  test2)";
    
        auto f(input.begin()), l(input.end());
    
        Parser p;
    
        std::vector<attribType> data;
        bool ok = qi::phrase_parse(f, l, *p, qi::space, data);
    
        if (ok)
        {
            std::cout << "Parse success\n";
            for(auto const& term : data)
                std::cout << "Term: '" << term << "'\n";
        }
        else
        {
            std::cout << "Parse failed\n";
        }
    
        if (f!=l)
            std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
    }