Search code examples
parsingc++11boostcommentsboost-regex

Boost regex is not matching the same as several regex websites


I am attempting to parse a string using regex, so that when I iterate over its matches, it will give me only the results. My goal is to find all

#include <stuff.h>
#include "stuff.h"

while ignoring them if they are part of a comment block such as

/*
     #include "stuff.h"
*/

Here is my function to read a file, convert it to string, and parse the string, creating tokens which are then iterated over to print them all. the tokes would contain stuff.h , stuff.h based on the previous lines.

The problem that I ran into was using this regex https://regex101.com/r/tQFDr4/2

The question is, is my regex wrong or is it something in the function?

void header_check::filename(const boost::filesystem::directory_iterator& itr)  //function takes directory path                     
{                                                                                                   
    std::string delimeter ("#include.+(?:<|\\\")(.+)(?:>|\\\")(?![^*\\/]* (?:\\*+(?!\\/)[^*\\/]*|\\/+(?!\\*)[^*\\/]*)*\\*\\/)");//regex storage                                                                      
    boost::regex regx(delimeter,boost::regex::perl);//set up regex                                                  
    boost::smatch match;                                                                              
    std::ifstream file (itr->path().string().c_str());//stream to transfer to stream
    std::string content((std::istreambuf_iterator<char>(file)),    
    std::istreambuf_iterator<char>());//string to be parsed
    boost::sregex_token_iterator iter (content.begin(),content.end(), regx, 0);    //creates a match for each search
    boost::sregex_token_iterator end;                                                                 
    for (int attempt =1; iter != end; ++iter) {                                                       
        std::cout<< *iter<<" include #"<<attempt++<<"\n";  //prints results                                             
    }                                                       
}  

Solution

  • First off, you have an extra space character in the regex.

    But the real problem is you're treating the whole input as a single line. If you set that flag: enter image description here

    you will find that regex101 shows the same results.

    In regex, all open quantifiers are greedy by default. As such, you must be a lot more specific. At the very start you have

    #include.+
    

    This is already the end of it, since .+ simply matches all of the content (up to and including the last line). Your only reprieve is that backtracking will occur so that at least 1 "tail" of the regex matches, but all the rest is "souped up" in between. Because .+ literally asks for 1 or as many as possible of any character!


    Attempted Fixes...

    1. make .+ be \s+ or so. In fact, it needs to be \s* because #include<iostream> is perfectly valid C++
    2. next, you cannot match like you did because you'd happily match #include <iostream" or #include "iostream>. And again, .* needs to be limited. In this case, you can make the closing delimiter completely deterministic (because the opening delimiter completely predicts it), so you can use non-greedy Kleene-star:

      #include\s*("(.*?)"|<(.*?)>)
      

    HOWEVER

    The real problem is that you're trying to parse a full on grammar with ... regexen¹.

    All I can say is

    Could you not?!

    Here's a suggestion using Boost Spirit:

    auto comment_ = space 
                  | "//" >> *(char_ - eol) 
                  | "/*" >> *(char_ - "*/")
                  ;
    

    Woah. That's a breath of fresh air. It's almost like programming, instead of wizardry and crossing your thumbs!

    Now for the real meat:

    auto include_ = "#include" >> (
            '<' >> *~char_('>') >> '>'
          | '"' >> *~char_('"') >> '"'
          );
    

    And of course you want have the proof of the pudding too:

    std::string header;
    bool ok = phrase_parse(content.begin(), content.end(), seek[include_], comment_, header);
    
    std::cout << "matched: " << std::boolalpha << ok << ": " << header << "\n";
    

    This parses a single header and prints: Live On Coliru

    matched: true: iostream
    

    Would it be a piece of cake to scale up to all the non-commented includes?

    std::vector<std::string> headers;
    bool ok = phrase_parse(content.begin(), content.end(), *seek[include_], comment_, headers);
    

    Oops. Two bugs. Firstly, we should not be matching our grammar. The best way would be to ensure we are at start of line, but that complicates the grammar. For now, let's disallow names spanning multiple lines:

    auto name_ = rule<struct _, std::string> {} = lexeme[
          '<' >> *(char_ - '>' - eol) >> '>'
        | '"' >> *(char_ - '"' - eol) >> '"'
    ];
    
    auto include_ = "#include" >> name_;
    

    That helps a bit. The other bug is actually tougher, and I think it's a library bug. The problem is that it sees all the includes as active? It turns out that seek does not correctly use the skipper after the first match.² For now, let's work around it:

    bool ok = phrase_parse(content.begin(), content.end(), *(omit[*(char_ - include_)] >> include_) , comment_, headers);
    

    It does take away from the elegance a bit, but it does work:

    The Full Monty

    The full demo Live On Coliru

    // #include <boost/graph/adjacency_list.hpp>
    
    #include "iostream"
    
    #include<fstream> /*
    #include <boost/filesystem.hpp>
    #include <boost/regex.hpp> */ //
    #include <boost/spirit/home/x3.hpp>
    
    
    void filename(std::string const& fname)  //function takes directory path                     
    {                                                                                                   
        using namespace boost::spirit::x3;
    
        auto comment_ = space 
              | "//" >> *(char_ - eol) 
              | "/*" >> *(char_ - "*/")
              ;
    
        auto name_ = rule<struct _, std::string> {} = lexeme[
              '<' >> *(char_ - '>' - eol) >> '>'
            | '"' >> *(char_ - '"' - eol) >> '"'
        ];
    
        auto include_ = "#include" >> name_;
    
        auto const content = [&]() -> std::string {
            std::ifstream file(fname);
            return { std::istreambuf_iterator<char>{file}, {} };//string to be parsed
        }();
    
        std::vector<std::string> headers;
        /*bool ok = */phrase_parse(content.begin(), content.end(), *(omit[*(char_ - include_)] >> include_) , comment_, headers);
    
        std::cout << "matched: " << headers.size() << " active includes:\n";
        for (auto& header : headers)
            std::cout << " - " << header << "\n";
    }
    
    int main() {
        filename("main.cpp");
    }
    

    Printing

    matched: 3 active includes:
     - iostream
     - fstream
     - boost/spirit/home/x3.hpp
    

    ¹ And it's not in Perl6, in which case you could be forgiven.

    ² I'll try to fix/report this tomorrow