Search code examples
grammarraku

Alternate version of grammar not working as I'd prefer


This code parses $string as I'd like:

#! /usr/bin/env raku

my $string = q:to/END/;
aaa bbb   # this has trailing spaces which I want to keep

       kjkjsdf
kjkdsf
END

grammar Markdown {
    token TOP {  ^ ([ <blank> | <text> ])+ $ }
    token blank { [ \h* <.newline> ]  }
    token text { <indent> <content> }
    token indent { \h* }
    token newline { \n }
    token content { \N*? <trailing>* <.newline> } 
    token trailing { \h+ }
}

my $match = Markdown.parse($string);
$match.say;

OUTPUT

「aaa bbb

       kjkjsdf
kjkdsf
」
 0 => 「aaa bbb
」
  text => 「aaa bbb
」
   indent => 「」
   content => 「aaa bbb
」
    trailing => 「   」
 0 => 「
」
  blank => 「
」
 0 => 「       kjkjsdf
」
  text => 「       kjkjsdf
」
   indent => 「       」
   content => 「kjkjsdf
」
 0 => 「kjkdsf
」
  text => 「kjkdsf
」
   indent => 「」
   content => 「kjkdsf
」

Now, the only problem I'm having is that I'd like the <trailing> level to be in the same level of the hierarchy as <indent> and <content> captures.

So I tried this grammar:

grammar Markdown {
    token TOP {  ^ ([ <blank> | <text> ])+ $ }
    token blank { [ \h* <.newline> ]  }
    token text { <indent> <content> <trailing>* <.newline> }
    token indent { \h* }
    token newline { \n }
    token content { \N*?  } 
    token trailing { \h+ }
}

However, it breaks the parsing. So I tried this:

    token TOP {  ^ ([ <blank> | <text> ])+ $ }
    token blank { [ \h* <.newline> ]  }
    token text { <indent> <content>*? <trailing>* <.newline> }
    token indent { \h* }
    token newline { \n }
    token content { \N  } 
    token trailing { \h+ }

And got:

 0 => 「aaa bbb
」
  text => 「aaa bbb
」
   indent => 「」
   content => 「a」
   content => 「a」
   content => 「a」
   content => 「 」
   content => 「b」
   content => 「b」
   content => 「b」
   trailing => 「   」
 0 => 「
」
  blank => 「
」
 0 => 「       kjkjsdf
」
  text => 「       kjkjsdf
」
   indent => 「       」
   content => 「k」
   content => 「j」
   content => 「k」
   content => 「j」
   content => 「s」
   content => 「d」
   content => 「f」
 0 => 「kjkdsf
」
  text => 「kjkdsf
」
   indent => 「」
   content => 「k」
   content => 「j」
   content => 「k」
   content => 「d」
   content => 「s」
   content => 「f」

This is pretty close to what I want but it has the undesirable effect of breaking <content> up into individual letters, which is not ideal. I could fix this pretty easily after the fact by massaging the $match object but would like to try to improve my skills with grammars.


Solution

  • quick and dirty

    my $string = q:to/END/;
    aaa bbb  
    
           kjkjsdf
    kjkdsf
    END
    
    grammar Markdown {
        token TOP {  ^ ([ <blank> | <text> ])+ $ }
        token blank { [ \h* <.newline> ]  }
        token text { <indent>? $<content>=\N*? <trailing>? <.newline> }
        token indent { \h+ }
        token newline { \n }
        token trailing { \h+ }
    }
    
    my $match = Markdown.parse($string);
    $match.say;
    

    lookahead assertions

    my $string = q:to/END/;
    aaa bbb  
    
           kjkjsdf
    kjkdsf
    END
    
    grammar Markdown {
        token TOP {  ^ ([ <blank> | <text> ])+ $ }
        token blank { [ \h* <.newline> ]  }
        token text { <indent>? <content> <trailing>? <.newline> }
        token indent { \h+ }
        token newline { \n }
        token content { [<!before <trailing>> \N]+  }
        token trailing { \h+ $$ }
    }
    
    my $match = Markdown.parse($string);
    $match.say;
    

    a little refactoring

    my $string = q:to/END/;
    aaa bbb  
    
           kjkjsdf
    kjkdsf
    END
    
    grammar Markdown {
        token TOP { ( <blank> | <text> )+ %% \n }
        token blank { ^^ \h* $$  }
        token text { <indent>? <content> <trailing>? }
        token indent { ^^ \h+ }
        token content { [<!before <trailing>> \N]+  }
        token trailing { \h+ $$ }
    }
    
    my $match = Markdown.parse($string);
    $match.say;