Search code examples
c++boost-spirit-qi

boost spirit grammar for parsing header columns


I want to parse header columns of a text file. The column names should be allowed to be quoted and any case of letters. Currently I am using the following grammar:

#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

template <typename Iterator, typename Skipper>
struct Grammar : qi::grammar<Iterator, void(), Skipper>
{
        static constexpr char colsep = '|';
        Grammar() : Grammar::base_type(header)
        {
                using namespace qi;
                using ascii::char_;
#define COL(name) (no_case[name] | ('"' >> no_case[name] >> '"'))
                header = (COL("columna") | COL("column_a")) >> colsep >>
                        (COL("columnb") | COL("column_b")) >> colsep >>
                        (COL("columnc") | COL("column_c")) >> eol >> eoi;
#undef COL
        }
        qi::rule<Iterator, void(), Skipper> header;
};

int main()
{
        const std::string s{"columnA|column_B|column_c\n"};
        auto begin(std::begin(s)), end(std::end(s));
        Grammar<std::string::const_iterator, qi::blank_type> p;
        bool ok = qi::phrase_parse(begin, end, p, qi::blank);

        if (ok && begin == end)
                std::cout << "Header ok" << std::endl;
        else if (ok && begin != end)
                std::cout << "Remaining unparsed: '" << std::string(begin, end) << "'" << std::endl;
        else
                std::cout << "Parse failed" << std::endl;
        return 0;
}

Is this possible without the use of a macro? Further I would like to ignore any underscores at all. Can this be achieved with a custom skipper? In the end it would be ideal if one could write:

header = col("columna") >> colsep >> col("columnb") >> colsep >> column("columnc") >> eol >> eoi;

where col would be an appropriate grammar or rule.


Solution

  • @sehe how can I fix this grammar to support "\"Column_A\"" as well? 6 hours ago

    By this time you should probably have realized that there's two different things going on here.

    Separate Yo Concerns

    On the one hand you have a grammar (that allows |-separated columns like columna or "Column_A").

    On the other hand you have semantic analysis (the phase where you check that the parsed contents match certain criteria).

    The thing that is making your life hard is trying to conflate the two. Now, don't get me wrong, there could be (very rare) circumstances where fusing those responsibilities together is absolutely required - but I feel that would always be an optimization. If you need that, Spirit is not your thing, and you're much more likely to be served with a handwritten parser.

    Parsing

    So let's get brain-dead simple about the grammar:

    static auto headers = (quoted|bare) % '|' > (eol|eoi);
    

    The bare and quoted rules can be pretty much the same as before:

    static auto quoted  = lexeme['"' >> *('\\' >> char_ | "\"\"" >> attr('"') | ~char_('"')) >> '"'];
    static auto bare    = *(graph - '|');
    

    As you can see this will implicitly take care of quoting and escaping as well whitespace skipping outside lexemes. When applied simply, it will result in a clean list of column names:

    std::string const s = "\"columnA\"|column_B| column_c \n";
    
    std::vector<std::string> headers;
    bool ok = phrase_parse(begin(s), end(s), Grammar::headers, x3::blank, headers);
    
    std::cout << "Parse " << (ok?"ok":"invalid") << std::endl;
    if (ok) for(auto& col : headers) {
        std::cout << std::quoted(col) << "\n";
    }
    

    Prints Live On Coliru

    Parse ok
    "columnA"
    "column_B"
    "column_c"
    

    INTERMEZZO: Coding Style

    Let's structure our code so that the separation of concerns is reflected. Our parsing code might use X3, but our validation code doesn't need to be in the same translation unit (cpp file).

    Have a header defining some basic types:

    #include <string>
    #include <vector>
    
    using Header = std::string;
    using Headers = std::vector<Header>;
    

    Define the operations we want to perform on them:

    Headers parse_headers(std::string const& input);
    bool header_match(Header const& actual, Header const& expected);
    bool headers_match(Headers const& actual, Headers const& expected);
    

    Now, main can be rewritten as just:

    auto headers = parse_headers("\"columnA\"|column_B| column_c \n");
    
    for(auto& col : headers) {
        std::cout << std::quoted(col) << "\n";
    }
    
    bool valid = headers_match(headers, {"columna","columnb","columnc"});
    std::cout << "Validation " << (valid?"passed":"failed") << "\n";
    

    And e.g. a parse_headers.cpp could contain:

    #include <boost/spirit/home/x3.hpp>
    
    namespace x3 = boost::spirit::x3;
    
    namespace Grammar {
        using namespace x3;
        static auto quoted  = lexeme['"' >> *('\\' >> char_ | "\"\"" >> attr('"') | ~char_('"')) >> '"'];
        static auto bare    = *(graph - '|');
        static auto headers = (quoted|bare) % '|' > (eol|eoi);
    }
    
    Headers parse_headers(std::string const& input) {
        Headers output;
        if (phrase_parse(begin(input), end(input), Grammar::headers, x3::blank, output))
            return output;
        return {}; // or throw, if you prefer
    }
    

    Validating

    This is what is known as "semantic checks". You take the vector of strings and check them according to your logic:

    #include <boost/range/adaptors.hpp>
    #include <boost/algorithm/string.hpp>
    
    bool header_match(Header const& actual, Header const& expected) {
        using namespace boost::adaptors;
        auto significant = [](unsigned char ch) {
            return ch != '_' && std::isgraph(ch);
        };
    
        return boost::algorithm::iequals(actual | filtered(significant), expected);
    }
    
    bool headers_match(Headers const& actual, Headers const& expected) {
        return boost::equal(actual, expected, header_match);
    }
    

    That's all. All the power of algorithms and modern C++ at your disposal, no need to fight with constraints due to parsing context.

    Full Demo

    The above, Live On Wandbox

    Both parts got significantly simpler:

    • your parser doesn't have to deal with quirky comparison logic
    • your comparison logic doesn't have to deal with grammar concerns (quotes, escapes, delimiters and whitespace)