Search code examples
grammarraku

Using a grammar to parse a string without lookahead?


Got this text:

Want this || Not this

The line may also look like this:

Want this | Not this

with a single pipe.

I'm using this grammar to parse it:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? <?before <divider>> }
       token divider { <[|]> ** 1..2 } 
       token post { \N* }
    } 

Is there a better way to do this? I'd love to be able to do something more like this:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    } 

But this does not work. And if I do this:

    grammar HC {
       token TOP {  <pre>* <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 } }
       token post { \N* }
    } 

Each character before divider gets its own <pre> capture. Thanks.


Solution

  • As always, TIMTOWTDI.

    I'd love to be able to do something more like this

    You can. Just switch the first two rule declarations from token to regex:

    grammar HC {
      regex TOP {  <pre> <divider> <post> }
      regex pre { \N*? }
      token divider { <[|]> ** 1..2 }
      token post { \N* }
    } 
    

    This works because regex disables :ratchet (unlike token and rule which enable it).

    (Explaining why you need to switch it off for both rules is beyond my paygrade, certainly for tonight, and quite possibly till someone else explains why to me so I can pretend I knew all along.)

    if I do this ... each character gets its own <pre> capture

    By default, "calling a named regex installs a named capture with the same name" [... couple sentences later:] "If no capture is desired, a leading dot or ampersand will suppress it". So change <pre> to <.pre>.

    Next, you can manually add a named capture by wrapping a pattern in $<name>=[pattern]. So to capture the whole string matched by consecutive calls of the pre rule, wrap the non-capturing pattern (<.pre>*?) in $<pre>=[...]):

    grammar HC {
           token TOP { $<pre>=[<.pre>*?] <divider> <post> }
           token pre { \N }
           token divider { <[|]> ** 1..2 }
           token post { \N* }
        }