I am new to writing parsers. I am attempting to create a parser which can extract US zip codes from input text. I have created the following parser patterns, which do most of what I want. I am able to match 5 digit zip codes, or 9 digit zip codes (90210-1234) as expected.
However, it does not allow me to avoid matching things like:
246764 (returns 46764)
578397 (returns 78397)
I wanted to specify some anchoring conditions for the right and left of the above pattern, in the hopes that I could eliminate the examples above. More specifically, I want to prohibit matching when digits or dashes are adjacent to the beginning or end of the candidate zip code.
Test data (bold entries should be matched)
Full code:
using It = std::string::const_iterator;
using ZipCode = boost::fusion::vector<It, It>;
namespace boost { namespace spirit { namespace x3 { namespace traits {
template <>
void move_to<It, ZipCode>(It b, It e, ZipCode& z)
z =
void Parse(std::string const& input)
auto start = std::begin(input);
auto begin = start;
auto end = std::end(input);
ZipCode current;
std::vector<ZipCode> matches;
auto const fiveDigits = boost::spirit::x3::repeat(5)[boost::spirit::x3::digit];
auto const fourDigits = boost::spirit::x3::repeat(4)[boost::spirit::x3::digit];
auto const dash = boost::spirit::x3::char_('-');
auto const notDashOrDigit = boost::spirit::x3::char_ - (dash | boost::spirit::x3::digit);
auto const zipCode59 =
-(¬DashOrDigit) >>
boost::spirit::x3::raw[fiveDigits >> -(dash >> fourDigits)] >>
while (begin != end)
if (!boost::spirit::x3::phrase_parse(begin, end, zipCode59, boost::spirit::x3::blank, current))
auto startOffset = std::distance(start, boost::fusion::at_c<0>(current));
auto endOffset = std::distance(start, boost::fusion::at_c<1>(current));
auto length = std::distance(boost::fusion::at_c<0>(current), boost::fusion::at_c<1>(current));
std::cout << "Matched (\"" << startOffset
<< "\", \""
<< endOffset
<< "\") => \""
<< input.substr(startOffset, length)
<< "\""
<< std::endl;
This code with the above test data produces the following output:
Matched ("0", "5") => "12345"
Matched ("29", "34") => "46764"
Matched ("42", "47") => "78397"
Matched ("68", "78") => "15222-1825"
If I change zipCode59 to the following, I get no hits back:
auto const zipCode59 =
¬DashOrDigit >>
boost::spirit::x3::raw[fiveDigits >> -(dash >> fourDigits)] >>
I have read through this question: Stop X3 symbols from matching substrings . However, this question makes use of a symbol table. I don't think this can work for me, because I lack the ability to specify hard-coded strings. I'm also unclear as to how the answer to that question manages to prohibit the leading content.
Using -(parser)
just makes (parser)
optional. Using it with -(&parser)
has literally no effect.
Perhaps you wanted a negative assertion ("lookahead"), which is !(parser)
(the opposite of &(parser)
Note that the potential confusion maybe because of the difference between unary minus (negative assertion) and binary minus (reducing character sets).
Asserting that a zip-code start with not a dash/digit seems ... confused. If you want to positively assert something else than a dash or digit would be &~char_("-0-9")
(using unary ~
to negate the character set) but it would prevent matching at the very start of input.
Shedding some of the complexity left and right I'd naively start out with something like:
using It = std::string::const_iterator;
using ZipCode = boost::iterator_range<It>;
auto Parse(std::string const& input) {
using namespace boost::spirit::x3;
auto dig = [](int n) { return repeat(n)[digit]; };
auto const zip59 = dig(5) >> -('-' >> dig(4));
auto const valid = zip59 >> !graph;
std::vector<ZipCode> matches;
if (!parse(begin(input), end(input), *seek[raw[valid]], matches))
throw std::runtime_error("parser failure");
return matches;
Which of course matches too much:
Matched '12345'
Matched '78397'
Matched '15222-1825'
Matched '53410-2807'
To limit it (and still match at start-of-input) you could seek[&('-'|digit)]
and then require a valid zip.
I freely admit to having had to fiddle with things a bit before getting it "right". In the process I created a debug helper:
auto trace_as = [&input](std::string const& caption, auto parser) {
return raw[parser] [([=,&input](auto& ctx) {
std::cout << std::setw(12) << (caption+":") << " '";
auto range = _attr(ctx);
for (auto ch : range) switch (ch) {
case '\0': std::cout << "\\0"; break;
case '\r': std::cout << "\\r"; break;
case '\n': std::cout << "\\n"; break;
default: std::cout << ch;
std::cout << "' at " << std::distance(input.begin(), range.begin()) << "\n";
auto const valid = seek[&trace_as("seek", '-' | digit)] >> raw[zip59] >> !graph;
std::vector<ZipCode> matches;
if (!parse(begin(input), end(input), -valid % trace_as("skip", *graph >> +space), matches))
throw std::runtime_error("parser failure");
Which produces the following additional diagnostic output:
seek: '1' at 0
skip: '\n ' at 5
seek: '4' at 13
skip: 'foo456\n ' at 10
seek: '5' at 23
skip: 'ba58r\n ' at 21
seek: '2' at 31
skip: '246764anc\n ' at 31
seek: '5' at 45
skip: '578397\n ' at 45
seek: '9' at 56
skip: '90210-\n ' at 56
seek: '1' at 67
skip: '15206-1\n ' at 67
seek: '1' at 79
skip: '\n ' at 89
seek: '1' at 94
Matched '12345'
Matched '15222-1825'
Now that the output is what we desire, let's cut the scaffolding again:
#include <boost/spirit/home/x3.hpp>
using It = std::string::const_iterator;
using ZipCode = boost::iterator_range<It>;
auto Parse(std::string const& input) {
using namespace boost::spirit::x3;
auto dig = [](int n) { return repeat(n)[digit]; };
auto const zip59 = dig(5) >> -('-' >> dig(4));
auto const valid = seek[&('-' | digit)] >> raw[zip59] >> !graph;
std::vector<ZipCode> matches;
if (!parse(begin(input), end(input), -valid % (*graph >> +space), matches))
throw std::runtime_error("parser failure");
return matches;
#include <iostream>
int main() {
std::string const sample = R"(12345
for (auto zip : Parse(sample))
std::cout << "Matched '" << zip << "'\n";
Matched '12345'
Matched '15222-1825'