I want to parse header columns of a text file. The column names should be allowed to be quoted and any case of letters. Currently I am using the following grammar:
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
template <typename Iterator, typename Skipper>
struct Grammar : qi::grammar<Iterator, void(), Skipper>
{
static constexpr char colsep = '|';
Grammar() : Grammar::base_type(header)
{
using namespace qi;
using ascii::char_;
#define COL(name) (no_case[name] | ('"' >> no_case[name] >> '"'))
header = (COL("columna") | COL("column_a")) >> colsep >>
(COL("columnb") | COL("column_b")) >> colsep >>
(COL("columnc") | COL("column_c")) >> eol >> eoi;
#undef COL
}
qi::rule<Iterator, void(), Skipper> header;
};
int main()
{
const std::string s{"columnA|column_B|column_c\n"};
auto begin(std::begin(s)), end(std::end(s));
Grammar<std::string::const_iterator, qi::blank_type> p;
bool ok = qi::phrase_parse(begin, end, p, qi::blank);
if (ok && begin == end)
std::cout << "Header ok" << std::endl;
else if (ok && begin != end)
std::cout << "Remaining unparsed: '" << std::string(begin, end) << "'" << std::endl;
else
std::cout << "Parse failed" << std::endl;
return 0;
}
Is this possible without the use of a macro? Further I would like to ignore any underscores at all. Can this be achieved with a custom skipper? In the end it would be ideal if one could write:
header = col("columna") >> colsep >> col("columnb") >> colsep >> column("columnc") >> eol >> eoi;
where col would be an appropriate grammar or rule.
@sehe how can I fix this grammar to support
"\"Column_A\""
as well? 6 hours ago
By this time you should probably have realized that there's two different things going on here.
On the one hand you have a grammar (that allows |
-separated columns like columna
or "Column_A"
).
On the other hand you have semantic analysis (the phase where you check that the parsed contents match certain criteria).
The thing that is making your life hard is trying to conflate the two. Now, don't get me wrong, there could be (very rare) circumstances where fusing those responsibilities together is absolutely required - but I feel that would always be an optimization. If you need that, Spirit is not your thing, and you're much more likely to be served with a handwritten parser.
So let's get brain-dead simple about the grammar:
static auto headers = (quoted|bare) % '|' > (eol|eoi);
The bare
and quoted
rules can be pretty much the same as before:
static auto quoted = lexeme['"' >> *('\\' >> char_ | "\"\"" >> attr('"') | ~char_('"')) >> '"'];
static auto bare = *(graph - '|');
As you can see this will implicitly take care of quoting and escaping as well whitespace skipping outside lexemes. When applied simply, it will result in a clean list of column names:
std::string const s = "\"columnA\"|column_B| column_c \n";
std::vector<std::string> headers;
bool ok = phrase_parse(begin(s), end(s), Grammar::headers, x3::blank, headers);
std::cout << "Parse " << (ok?"ok":"invalid") << std::endl;
if (ok) for(auto& col : headers) {
std::cout << std::quoted(col) << "\n";
}
Prints Live On Coliru
Parse ok
"columnA"
"column_B"
"column_c"
Let's structure our code so that the separation of concerns is reflected. Our parsing code might use X3, but our validation code doesn't need to be in the same translation unit (cpp file).
Have a header defining some basic types:
#include <string>
#include <vector>
using Header = std::string;
using Headers = std::vector<Header>;
Define the operations we want to perform on them:
Headers parse_headers(std::string const& input);
bool header_match(Header const& actual, Header const& expected);
bool headers_match(Headers const& actual, Headers const& expected);
Now, main
can be rewritten as just:
auto headers = parse_headers("\"columnA\"|column_B| column_c \n");
for(auto& col : headers) {
std::cout << std::quoted(col) << "\n";
}
bool valid = headers_match(headers, {"columna","columnb","columnc"});
std::cout << "Validation " << (valid?"passed":"failed") << "\n";
And e.g. a parse_headers.cpp
could contain:
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
namespace Grammar {
using namespace x3;
static auto quoted = lexeme['"' >> *('\\' >> char_ | "\"\"" >> attr('"') | ~char_('"')) >> '"'];
static auto bare = *(graph - '|');
static auto headers = (quoted|bare) % '|' > (eol|eoi);
}
Headers parse_headers(std::string const& input) {
Headers output;
if (phrase_parse(begin(input), end(input), Grammar::headers, x3::blank, output))
return output;
return {}; // or throw, if you prefer
}
This is what is known as "semantic checks". You take the vector of strings and check them according to your logic:
#include <boost/range/adaptors.hpp>
#include <boost/algorithm/string.hpp>
bool header_match(Header const& actual, Header const& expected) {
using namespace boost::adaptors;
auto significant = [](unsigned char ch) {
return ch != '_' && std::isgraph(ch);
};
return boost::algorithm::iequals(actual | filtered(significant), expected);
}
bool headers_match(Headers const& actual, Headers const& expected) {
return boost::equal(actual, expected, header_match);
}
That's all. All the power of algorithms and modern C++ at your disposal, no need to fight with constraints due to parsing context.
The above, Live On Wandbox
Both parts got significantly simpler: