Search code examples
c++parsingboost-spirit-qi

rule to extract key+phrases from a text document


I want to extract the key phrases from the document: "something KEY phrase END something ... ect". My rule works well but the result does not contain of key name. What should be the rule in order to get a string: "KEY phrase". Thank you for the advice.

std::vector<std::string> doc; 
bool r = qi::phrase_parse(first,last, 
  ( qi::omit[*(qi::char_-"KEY")] 
    >> qi::lexeme[ "KEY"
    >> *(qi::char_-"KEY" -"END")] ) % "END"
, qi::space, doc);

Solution

  • qi::lit(...) doesn't synthesize an attribute.

    qi::string(...) does.

    Replace "KEY" with qi::string("KEY"), likely. (hard to tell without knowing the type of doc)

    bool r = qi::phrase_parse(first,last, 
      ( qi::omit[*(qi::char_-"KEY")] 
        >> qi::lexeme[ qi::string("KEY")
        >> *(qi::char_-"KEY" -"END")] ) % "END"
    , qi::space, doc);
    

    BONUS See also seek[] parser directive from the Spirit Repository:

    The seek[] parser-directive skips all input until the subject parser matches.

    Here's what I'd do:

    Live On Coliru

    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/repository/include/qi_seek.hpp>
    namespace qi = boost::spirit::qi;
    namespace qr = boost::spirit::repository::qi;
    
    extern std::string const sample; // below
    
    int main() {
        auto f(sample.begin()), l(sample.end());
    
        std::vector<std::string> phrases;
    
        if (qi::parse(f,l, *qi::as_string[
                    qr::seek[qi::string("KEY")] >> *(qi::char_ - "END")
                ], phrases)) 
        {
            for (size_t i = 0; i < phrases.size(); ++i) 
                std::cout << "keyphrase #" << i << ": '" << phrases[i] << "'\n";
        }
    }
    

    Prints:

    keyphrase #0: 'KEY@v/0qwJTjgFQwNmose7LiEmAmKpIdK3TPmkCs@'
    keyphrase #1: 'KEY@G1TErN1QSSKi17BSnwBKML@'
    keyphrase #2: 'KEY@pWhBKmc0sD+o@'
    keyphrase #3: 'KEY@pwgjNJ0FvWGRezwi74QdIQdmUuKVyquWuvXz4tBOXqMMqco@'
    keyphrase #4: 'KEY@aJ3QUfLh3AqfKyxcUSiDbanZmCNGza6jb6pZ@'
    keyphrase #5: 'KEY@bYJzitZUyXlgPA009qBpleHIJ9uJUSdJO78iisUgHkoqUpf+oXZQF9X/7v2fikgemCD@'
    

    Sample data included in a comment in this answer: /here/