Search code examples
boostboost-spiritboost-spirit-qi

Boost spirit grammar to skip php comments. Is this working code written with the current recommended boost parser?


I've done a function that strips all comments and a few other elements from php code. It's working fine, but, as I do not deeply undertand the code, I have some doubts:

  • Am I using the latest technology to parse a grammar in boost? A few years ago I used only Spirit but I didn't use qi.

  • Is this the right approach with spirit?

  • What is the reason for putting the grammar inside a block of code?

      #include <boost/spirit/include/qi.hpp>
    
      namespace qi = boost::spirit::qi;
    
      using namespace std;
    
      string non_comments_php_code(const string &contents)
      {
          string non_comments_code;
          using Iterator = string::const_iterator;
          Iterator begin = contents.cbegin(), end = contents.cend();
    
          using Skipper = qi::rule<Iterator>;
          auto identifier = qi::standard_wide::char_;
          Skipper block_comment, single_line_comment, skipper,
              php_tag, php_comment, php_namespace, php_use;
          {
              using namespace qi;
              single_line_comment = "//" >> *(standard_wide::char_ - eol) >> (eol|eoi);
              block_comment       = ("/*" >> *(block_comment | standard_wide::char_ - "*/")) > ("*/"|eoi);
              php_tag             = lit("<?php") | lit("?>");
              php_comment         = '#' >> *(standard_wide::char_ - eol) >> (eol|eoi);
              php_namespace       = lit("namespace ") >> *(standard_wide::char_ - (eol|';')) >> (eol|';');
              php_use             = lit("use ") >> *(standard_wide::char_ - (eol|';')) >> (eol|';');
              skipper             = space | single_line_comment | block_comment | php_tag | php_namespace | php_use | php_comment;
          }
          bool ok = phrase_parse(begin, end, skipper, skipper);
          if ( begin != end) {
              while( begin != end && *begin != '\n') {
                  non_comments_code += *begin++;
              }
          }
          return non_comments_code;
     }
    

EDIT: The goal of the function is to return any code in the php file that is neither a comment nor a (use|namespace) statement nor the tags <?php .. ?>

I am using templates to autogenerate php code, and once the code is created, I can add custom code. Previous to calling this function I manage to delete all the automatic code that was generated, and then this function tells me if I have added any custom code to the php file.

Thas is why I say it is working, as I dont mind the code, I just want to know if there is any custom code at all.

EDIT 2:

Example of input string:

<?php

namespace tests;
use codeception/tests;

/* This unit test tests something */

class Tester {

    /// @group debug
    public function testsFeature(/*AcceptanceTester*/ $I) {
        $I->assertTrue($this->testsAll());
    }
}
?>

And the required output:

classTester{publicfunctiontestsFeature($I){$I->assertTrue($this->testsAll());}}

In fact, the result has no any useful use, I just need to know if it is empty.

There are other approaches to solve the whole problem, like regenerating the template in a temp file and diff'ing it to get the addition changes, but that 1) would be far more expensive, 2) I really want to learn to use boost grammar parsers.


Solution

  • Oh I see, the whole thing was a bit inside-out. You are "parsing" the stuff that you want to "skip" and "skipping" the stuff you need "outside the parser".

    It seems a lot more straightforward to have a parser and skipper in their designated roles. Let's create a StripCommentsParser:

    using Iterator = std::string::const_iterator;
    struct StripCommentParser : qi::grammar<Iterator, std::string()> {
    

    This declares the output std::string which we will use to collect the desired output. I'd put all the bits together like so:

    struct StripCommentParser : qi::grammar<Iterator, std::string()> {
        StripCommentParser() : StripCommentParser::base_type(start) {
    
            using namespace qi;
            single_line_comment = "//" >> *(qi::char_ - eol) >> (eol | eoi);
            block_comment       = ("/*" >> *(block_comment | qi::char_ - "*/")) > ("*/" | eoi);
            php_tag             = lit("<?php") | lit("?>");
            php_comment         = '#' >> *(qi::char_ - eol) >> (eol | eoi);
            php_namespace       = lit("namespace ") >> *(qi::char_ - (eol | ';')) >> (eol | ';');
            php_use             = lit("use ") >> *(qi::char_ - (eol | ';')) >> (eol | ';');
    
            start = qi::skip(space | single_line_comment | block_comment | php_tag | php_namespace | php_use | php_comment)[*char_];
        }
    
      private:
        qi::rule<Iterator, std::string()> start;
        qi::rule<Iterator> block_comment, single_line_comment, php_tag, php_comment, php_namespace, php_use;
    };
    
    std::string non_comments_php_code(std::string const& contents) {
        std::string non_comments_code;
        parse(begin(contents), end(contents), StripCommentParser{}, non_comments_code);
        return non_comments_code;
    }
    

    Notes:

    • no phrase_parse (as you should definitely not be able to change the skipper)

    • rules combined into a grammar struct for encapsulation and re-use. Just slam static onto the parser and you insta-optimized your code:

      std::string non_comments_php_code(std::string const& contents) { std::string non_comments_code; static const StripCommentParser scp; parse(begin(contents), end(contents), scp, non_comments_code); return non_comments_code; }

    Observations

    • What's good

      I like how your code pays attention to when the rules should match at qi::eoi. This is oft neglected, and it shows you understand PEG grammar productions well.

    • What's bad

      As I commented before there are a lot of other things weird about this code:

      • you're also skipping qi::space?! That seems very unhelpful if the output should be useful for anything (other than counting code size ignoring significant whitespace?)

      • you're inside-out parse driver had extra logic to randomly also skip '\n'. That's odd, because

        (a) you had an entire parser to skip things already
        (b) that parser rule (`skipper`) includes `qi::space`, which includes `'\n'`
        
      • You're randomly using standard_wide. This is a bad idea because your input AND output are not wide-character. Also, I expect PHP is UTF8 by definition/convention.

      • Your patterns arbitrarily assume certain space use. E.g. "namespace\t" will not be matched

      If you care to explain what the real goal of the code is, I can tell you what I'd write.

    Questions

    Q. Am I using the latest technology to parse a grammar in boost? A few years ago I used only Spirit but I didn't use qi.

    That's interesting. "Years ago" is when I'd use Qi. Nowadays I still recommend Qi, but you have the option of using C++14 Spirit X3 (going C++17 now).

    If all you're doing is squeezing ignorable input then I'd say X3 is a better choice. However there are areas where I think X3 isn't as mature (e.g. attribute propagation/handling).

    Q. Is this the right approach with spirit?

    Yes and no. Yes in the sense that you create rules. No in the sense that you used the skipper as the grammar (and skipper too). And wrote your own parser around the skipper. I think the above example is what you want

    Q. What is the reason for putting the grammar inside a block of code?

    Scope. That's what blocks do. In this case it limits the scope of all the detail rules, as well as the using namespace directive. The struct has the same goal but packaging it up in a reusable instance.