Search code examples
c++parsingc++14boost-spiritboost-spirit-x3

How do I properly specify anchoring conditions in Spirit X3?


I am new to writing parsers. I am attempting to create a parser which can extract US zip codes from input text. I have created the following parser patterns, which do most of what I want. I am able to match 5 digit zip codes, or 9 digit zip codes (90210-1234) as expected.

However, it does not allow me to avoid matching things like:

246764 (returns 46764)
578397 (returns 78397)

I wanted to specify some anchoring conditions for the right and left of the above pattern, in the hopes that I could eliminate the examples above. More specifically, I want to prohibit matching when digits or dashes are adjacent to the beginning or end of the candidate zip code.

Test data (bold entries should be matched)

12345

foo456

ba58r

246764anc

578397

90210-
15206-1
15222-1825
15212-4267-53410-2807

Full code:

using It = std::string::const_iterator;
using ZipCode = boost::fusion::vector<It, It>;

namespace boost { namespace spirit { namespace x3 { namespace traits {
    template <>
    void move_to<It, ZipCode>(It b, It e, ZipCode& z)
    {
        z =
        {
            b,
            e
        };
}}}}}

void Parse(std::string const& input)
{
    auto start = std::begin(input);
    auto begin = start;
    auto end = std::end(input);

    ZipCode current;
    std::vector<ZipCode> matches;

    auto const fiveDigits = boost::spirit::x3::repeat(5)[boost::spirit::x3::digit];
    auto const fourDigits = boost::spirit::x3::repeat(4)[boost::spirit::x3::digit];
    auto const dash = boost::spirit::x3::char_('-');
    auto const notDashOrDigit = boost::spirit::x3::char_ - (dash | boost::spirit::x3::digit);

    auto const zipCode59 = 
        boost::spirit::x3::lexeme
        [
            -(&notDashOrDigit) >> 
            boost::spirit::x3::raw[fiveDigits >> -(dash >> fourDigits)] >> 
            &notDashOrDigit
        ];

    while (begin != end)
    {
        if (!boost::spirit::x3::phrase_parse(begin, end, zipCode59, boost::spirit::x3::blank, current))
        {
            ++begin;
        }
        else
        {
            auto startOffset = std::distance(start, boost::fusion::at_c<0>(current));
            auto endOffset = std::distance(start, boost::fusion::at_c<1>(current));
            auto length = std::distance(boost::fusion::at_c<0>(current), boost::fusion::at_c<1>(current));
            std::cout << "Matched (\"" << startOffset
                << "\", \"" 
                << endOffset
                << "\") => \""
                << input.substr(startOffset, length)
                << "\""
                << std::endl;
        }
    }
}

This code with the above test data produces the following output:

Matched ("0", "5") => "12345"
Matched ("29", "34") => "46764"
Matched ("42", "47") => "78397"
Matched ("68", "78") => "15222-1825"

If I change zipCode59 to the following, I get no hits back:

auto const zipCode59 = 
    boost::spirit::x3::lexeme
    [
        &notDashOrDigit >> 
        boost::spirit::x3::raw[fiveDigits >> -(dash >> fourDigits)] >> 
        &notDashOrDigit
    ];

I have read through this question: Stop X3 symbols from matching substrings . However, this question makes use of a symbol table. I don't think this can work for me, because I lack the ability to specify hard-coded strings. I'm also unclear as to how the answer to that question manages to prohibit the leading content.


Solution

  • Using -(parser) just makes (parser) optional. Using it with -(&parser) has literally no effect.

    Perhaps you wanted a negative assertion ("lookahead"), which is !(parser) (the opposite of &(parser)).

    Note that the potential confusion maybe because of the difference between unary minus (negative assertion) and binary minus (reducing character sets).

    Asserting that a zip-code start with not a dash/digit seems ... confused. If you want to positively assert something else than a dash or digit would be &~char_("-0-9") (using unary ~ to negate the character set) but it would prevent matching at the very start of input.

    Positive approach

    Shedding some of the complexity left and right I'd naively start out with something like:

    using It = std::string::const_iterator;
    using ZipCode = boost::iterator_range<It>;
    
    auto Parse(std::string const& input) {
        using namespace boost::spirit::x3;
        auto dig = [](int n) { return repeat(n)[digit]; };
        auto const zip59 = dig(5) >> -('-' >> dig(4));
        auto const valid = zip59 >> !graph;
    
        std::vector<ZipCode> matches;
        if (!parse(begin(input), end(input), *seek[raw[valid]], matches))
            throw std::runtime_error("parser failure");
    
        return matches;
    }
    

    Which of course matches too much:

    Live On Coliru

    Matched '12345'
    Matched '78397'
    Matched '15222-1825'
    Matched '53410-2807'
    

    Doing The Heroics

    To limit it (and still match at start-of-input) you could seek[&('-'|digit)] and then require a valid zip.

    I freely admit to having had to fiddle with things a bit before getting it "right". In the process I created a debug helper:

    auto trace_as = [&input](std::string const& caption, auto parser) { 
        return raw[parser] [([=,&input](auto& ctx) { 
            std::cout << std::setw(12) << (caption+":") << " '";
            auto range = _attr(ctx);
            for (auto ch : range) switch (ch) {
                case '\0': std::cout << "\\0"; break;
                case '\r': std::cout << "\\r"; break;
                case '\n': std::cout << "\\n"; break;
                default: std::cout << ch;
            }
            std::cout << "' at " << std::distance(input.begin(), range.begin()) << "\n";
        })]; 
    };
    
    auto const valid = seek[&trace_as("seek", '-' | digit)] >> raw[zip59] >> !graph;
    
    std::vector<ZipCode> matches;
    if (!parse(begin(input), end(input), -valid % trace_as("skip", *graph >> +space), matches))
        throw std::runtime_error("parser failure");
    

    Which produces the following additional diagnostic output:

    Live On Coliru

           seek: '1' at 0
           skip: '\n    ' at 5
           seek: '4' at 13
           skip: 'foo456\n    ' at 10
           seek: '5' at 23
           skip: 'ba58r\n    ' at 21
           seek: '2' at 31
           skip: '246764anc\n    ' at 31
           seek: '5' at 45
           skip: '578397\n    ' at 45
           seek: '9' at 56
           skip: '90210-\n    ' at 56
           seek: '1' at 67
           skip: '15206-1\n    ' at 67
           seek: '1' at 79
           skip: '\n    ' at 89
           seek: '1' at 94
    Matched '12345'
    Matched '15222-1825'
    

    Now that the output is what we desire, let's cut the scaffolding again:

    Full Listing

    Live On Coliru

    #include <boost/spirit/home/x3.hpp>
    
    using It = std::string::const_iterator;
    using ZipCode = boost::iterator_range<It>;
    
    auto Parse(std::string const& input) {
        using namespace boost::spirit::x3;
        auto dig = [](int n) { return repeat(n)[digit]; };
        auto const zip59 = dig(5) >> -('-' >> dig(4));
        auto const valid = seek[&('-' | digit)] >> raw[zip59] >> !graph;
    
        std::vector<ZipCode> matches;
        if (!parse(begin(input), end(input), -valid % (*graph >> +space), matches))
            throw std::runtime_error("parser failure");
    
        return matches;
    }
    
    #include <iostream>
    int main() {
        std::string const sample = R"(12345
    foo456
    ba58r
    246764anc
    578397
    90210-
    15206-1
    15222-1825
    15212-4267-53410-2807)";
    
        for (auto zip : Parse(sample))
            std::cout << "Matched '" << zip << "'\n";
    }
    

    Prints:

    Matched '12345'
    Matched '15222-1825'