I want to write a Raku grammar to check a monthly report about work contribution factors, written in Racket/Scribble.
The report is divided, at the highest level, into monthly sections, and beneath that, into contribution factors. Within that subsection, a repeating set of contribution factors describes what I did for that contribution factor, during that month. I've included pared-down Racket code here.
The contribution factors are named "Quaffing" and "Quenching" as stand-ins for real contribution factors. Although I haven't included them here, there are further subsections (and subsubsections). Within each month, I include a standard set of subsections and subsubsections. As the first part of each subsection and subsubsection name, I include a standard name. As the second part of each subsection and subsubsection name, I tack on the year and month. The year and month is written out like "2024 12(December)". This, of course, changes each month, and keeps the section, subsections, and subsubsections distinct across the whole document.
I want to use a Raku grammar to parse the Racket-Scribble code to ensure it's consistently formatted. I want to ensure that the all the sections, subsections, and subsubsections are in place, and fit the pattern of standard subsection and subsubsection name, followed by current year/month.
For each month's section, I need the year/month to change, and I want the grammar to do so automatically.
Here is the Racket/Scribble code:
#lang scribble/manual
@title["\
Contribution Monthly Report\
" #:version "0.001"]
@table-of-contents{}
@section[#:tag "\
Report of 2024 12(December) 31\
"]{Report of 2024 12(December) 31}
@subsection{Contribution Factors Progress, 2024 12(December)}
@subsubsection[#:tag "\
Factor 1: Quaffing, 2024 12(December)\
"]{Factor 1: @italic{Quaffing}, 2024 12(December)}
Random lines of text.
@subsubsection[#:tag "\
Factor 2: Quenching, 2024 12(December)\
"]{Factor 2: @italic{Quenching}, 2024 12(December)}
Random lines of text.
@section[#:tag "\
Report of 2024 11(November) 30\
"]{Report of 2024 11(November) 30}
@subsection{Contribution Factors Progress, 2024 11(November)}
@subsubsection[#:tag "\
Factor 1: Quaffing, 2024 11(November)\
"]{Factor 1: @italic{Quaffing}, 2024 11(November)}
Lines of Text
@subsubsection[#:tag "\
Factor 2: Quenching, 2024 11(November)\
"]{Factor 2: @italic{Quenching}, 2024 11(November)}
Lines of text.
@index-section{}
For context, and reference, here is the whole Raku grammar I'm using to parse the above Racket/Scribble code. This code sample shows that I set a dynamic variable $*tsymm
[this sections year month Month] to hold the changing year month string that will be appended to the subsection and subsubsection name patterns. I've left in small debugging snippets.
Further on in this question, I've also placed just the token where I have the problem. :
use v6;
#use Test;
use Grammar::Tracer;
# Hardcoded file name
my $file-name = 'short_obfu_Monthly_Notes.rkt';
# Slurp the file content
my $file-content = try $file-name.IO.slurp;
if $! {
die "Error reading file '$file-name': $!";
}
grammar MonthlyReport {
#my $*tsymm;
token TOP {
:my $*tsymm;
^
<lang-statement>
<title>
<table-of-contents>
<monthly-cycle>+ # this token contains a refererence to the token with the problem.
<index>
$
}
token lang-statement {
^^'#' lang \s+ scribble '/' manual \n
}
token title {
\n
'@title["\\' \n
'Contribution Monthly Report\\' \n
'" #:version "0.001"]' \n
#{say '「' ~ $¢ ~ '」';}
}
token table-of-contents {
\n
'@table-of-contents{}' \s*? \n
}
token monthly-cycle {
{say $*tsymm}
<section-wrt-month>
<contribution-factors-progress>
}
token section-wrt-month { # This token is the problem.
\n
'@section[#:tag "' \\ \n
'Report of ' $<this-sections-yyyy-mm-Month>=(\d\d\d\d \s \d\d \([January|February|March|April|May|June|July|August|September|October|November|December]\)) \s [29|30|31] \\ \n
'"]{Report of ' $<this-sections-yyyy-mm-Month> \s \d\d \} \n
{say " trying 1 trying:\n\n $/ \n\n";}
{say " trying 2 trying:\n\n $/{'this-sections-yyyy-mm-Month'} \n\n";}
{$*tsymm = $<this-sections-yyyy-mm-Month>;}
#{$*tsymm.say;}
#{say '「' ~ $¢ ~ '」';}
}
token contribution-factors-progress {
\n
'@subsection{Contribution Factors Progress, ' $*tsymm \} \n
<factor1>
<factor2>
#{say '「' ~ $¢ ~ '」';}
}
token factor1 {
\n
'@subsubsection[#:tag "\\' \n
'Factor 1: Quaffing, ' $*tsymm \\ \n
'"]{Factor 1: @italic{Quaffing}, ' $*tsymm \} \n
.*? <?before \@subsubsection>
#{say 'factor 1 ->「' ~ $¢ ~ '」<- factor 1';}
}
token factor2 {
'@subsubsection[#:tag "\\' \n
'Factor 2: Quenching, ' $*tsymm \\ \n
'"]{Factor 2: @italic{Quenching}, ' $*tsymm \} \n
.*? <?before \@subsubsection>
#{say 'factor 2 ->「' ~ $¢ ~ '」<- factor 2';}
}
token index {
\n
'@index-section{}'
#{say 'index ->「' ~ $¢ ~ '」<- index';}
}
}
# Check the format of the file content
if MonthlyReport.parse($file-content) {
say "The file format is valid.";
} else {
say "The file format is invalid.";
}
The token that's not doing what I want is section-wrt[with regard to]-month
. This is the same code as above, just excerpted here to allow focus.
token section-wrt-month {
\n
'@section[#:tag "' \\ \n
'Report of ' $<this-sections-yyyy-mm-Month>=(\d\d\d\d \s \d\d \([January|February|March|April|May|June|July|August|September|October|November|December]\)) \s [29|30|31] \\ \n
'"]{Report of ' $<this-sections-yyyy-mm-Month> \s \d\d \} \n
{say " trying 1 trying:\n\n $/ \n\n";}
{say " trying 2 trying:\n\n $/{'this-sections-yyyy-mm-Month'} \n\n";}
{$*tsymm = $<this-sections-yyyy-mm-Month>;}
#{$*tsymm.say;}
#{say '「' ~ $¢ ~ '」';}
}
I expected the named regex, $<this-sections-yyyy-mm-Month>=(\d\d\d\d \s \d\d \([January|February|March|April|May|June|July|August|September|October|November|December]\))
to be set when it finds the month section (this does work). I want it to reset on the second pass through the section-wrt-month
, but this does not expected it to change to 2024 11(November)
, but it does not.
I tried changing the token
to a rule
and a regex
, but none of those helped.
I tried setting $*tsymm
to $0, but that does not work.
I consulted ChatGPT, o1, but it lectured me, incorrectly, about details of the alternations within the regex. When I tried what it (so confidently) lectured me on, it was not true, and was not related to the main problem.
I tried searching this out in the Raku on-line documentation as well as in several Raku/Perl6 books I have. They don't get into enough detail to help with this.
The output contains the ANSI coloring and shows the failure:
enter image description hereOutput
TL;DR This initial answer provides terse summaries of:
A way to make your grammar work.
What your code is doing wrong.
Why I think you got confused.
I intend to write one or more other answers, and/or later edit this one. The point would be to go into greater depth for the above three topics plus some others. But I wanted to give you something tonight, partly because my plan may not pan out, and partly to provide something in the meantime even if I do end up writing more.
Insert a \n
at the start of factor2
. This is consistent with all the other tokens you'd written. It's a tidy up coordinated with the second change:
Add a token end-of-section { $ | <before \n \@ <[a..zA..Z-]>*? 'section'> }
to the grammar and replace the <?before \@subsubsection>
patterns in the two factor tokens with <end-of-section>
.
I'm not saying those are necessarily the changes you really want for your full grammar. I am saying they work for the code you've shared in your question, and will hopefully be illuminating and perhaps a step forward to an appropriate solution.
The regex .*? <?before \@subsubsection>
matches all text from the current parse position forward to just before the next instance of the text @subsubsection
.
The first use of this pattern in your factor1
code works as you want. That's because the @subsubsection
that the <?before \@subsubsection>
matches is the one immediately following the random text you wrote that is still within December.
But the first use of this pattern in your factor2
code does not work as you want:
It starts to do parsing at the point immediately following where the factor1
token finished off matching. That is to say it starts at the (blank line before the) @subsubsection[#:tag "\ Factor 2:
that's still in December. This is still what you want.
It then keeps matching until it reaches the next @subsubsection
. But the next one is in the November data!
The upshot is that the first time through section-wrt-month
does "successfully" match, but it achieves that "success" erroneously -- it gobbles up the input part way into November's data as it matches!
Thus the second call of section-wrt-month
begins its matching at the first (blank line before the) @subsubsection[#:tag "\ Factor 1: Quaffing, 2024 11(November)\
. This is of course the wrong place to be parsing. So it fails to match. And then the index
token also begins at the same place, which is the wrong place for it too, so it also fails to match, and then the whole parse fails.
I imagine there are likely many factors leading to your confusion including:
Weaknesses of the debugging tools you're using? Grammar::Tracer
was a wonderful new tool when it was first introduced (in 2011). But there are other options, and it looks like this old tool led you astray. (I imagine you were tricked by the green lights on the first two factor token matches. The match capture string it displays is truncated, so you can't see that while factor1
captured as you wanted, factor2
captured too much.)
Lack of familiarity with Raku? It looks like you know a lot. Dynamic variables!?! $¢
!?! But it's hard to know if that's just ChatGPT throwing random guessing at you.
Lack of familiarity with regexing and/or being thrown by thinking the problem was something to do with use of Raku? Again, it's plausible you know regexing well, or ChatGPT thinks it does, but a fundamental problem here was not realizing what .*?
was doing. The regex atom .*?
is not Raku specific but is instead found in pretty much all regex languages. Similarly, syntax aside, <?before foo>
is just a look ahead predicate which has the same semantics in Raku as it does in the many (most) other regex languages/libraries/engines which have the same feature.
As I said at the start, I hope to later provide guidance so that you have much more fun and/or are much more productive than I imagine you managed with this work/exercise so far. Or perhaps others will pitch in with comments or answers.