Search code examples
perlmarpa

Discard and skip over unstructured text with Perl Marpa?


I'm using Marpa::R2::Scanless::G to parse a legacy text file format. The file format has a well structured section up top, followed by a poorly structured mess of text and uuencoded stuff. The latter stuff can be entirely ignored, but I can't figure out how to tell the Marpa SLIF interface: You're done; don't bother with the remaining text.

In very simplified terms a file might look like this:

("field_a_val"  1,
 "field_b_vals" (1,2,3),
 "field_c_pairs" ((a 1)(b 2)(c 3))
)now_stuff_i_dont_care_about a;oiwermnv;alwfja;sldfa
asdf343avadfg;okm;om;oia3
e{<|1ydblV, HYED c"L. 78b."8
U=nK Wpw: Qh(e x!,~dU...

I have all the data I need parsed out of the top section, but when it hits the bottom junk if I don't try to match it I get: Error in SLIF parse: Parse exhausted, but lexemes remain.

I cannot figure out how to craft a term that says to slurp up potentially megabytes of crap, just keep going to the end of the file regardless of the encountered text. No luck with my attempts to use :discard or 'pause => after', though I'm likely misusing them.

For context I don't have a solid understanding of parsing and lexing. I banged on the grammar until it worked.


Solution

  • The simplest thing to do would be to introduce a lexeme that matches all the rest you're not interested in:

    lexeme default = latm => 1  # this prevents the rest from matching the whole document
    
    Header
      ::= ActualHeader (AllTheRest) action => ::first
    ActualHeader
      ::= ... # your code here
    ...
    
    AllTheRest
      ::=           action => ::undef  # rest is optional
    AllTheRest
      ::= THE_REST  action => ::undef  # matches anything
    THE_REST ~ [\s\S]+
    

    We cannot use a :discard rule for THE_REST because that would allow the rest to occur anywhere, but we only want to allow it at the end. The character class [\s\S] matches all characters.