Lexing/Parsing "here" documents

For those that are experts in lexing and parsing... I am attempting to write a series of programs in perl that would parse out IBM mainframe z/OS JCL for a variety of purposes, but am hitting a roadblock in methodology. I am mostly following the lexing/parsing ideology put forth in "Higher Order Perl" by Mark Jason Dominus, but there are some things that I can't quite figure out how to do.

JCL has what's called inline data, which is very similar to "here" documents. I am not quite sure how to lex these into tokens.

The layout for inline data is as follows:

//DDNAME   DD *
this is the inline data
this is some more inline data
/*
...

Conventionally, the "*" after the "DD" signifies that following lines are the inline data itself, terminated by either "/*" or the next valid JCL record (starting with "//" in the first 2 columns).

More advanced, the inline data could appear as such:

//DDNAME   DD *,DLM=ZZ
//THIS LOOKS LIKE JCL BUT IT'S ACTUALLY DATA
//MORE DATA MASQUERADING AS JCL
ZZ
...

Sometimes the inline data is itself JCL (perhaps to be pumped to a program or the internal reader, whatever).

But here's the rub. In JCL, the records are 80 bytes, fixed in length. Everything past column 72 (cols 73-80) is a "comment". As well, everything following a blank that follows valid JCL is likewise a comment. Since I am looking to manipulate JCL in my programs and spit it back out, I'd like to capture comments so that I can preserve them.

So, here's an example of inline comments in the case of inline data:

//DDNAME   DD *,DLM=ZZ THIS IS A COMMENT                                COL73DAT
data
...
ZZ
...more JCL

I originally thought that I could have my top-most lexer pull in a line of JCL and immediately create a non-token for cols 1-72 and then a token (['COL73COMMENT',$1]) for the column 73 comment, if any. This would then pass downstream to the next iterator/tokenizer a string of the cols 1-72 text followed by the col73 token.

But how would I, downstream from there, grab the inline data? I'd originally figured that the top-most tokenizer could look for a "DD \*(,DLM=(\S*))" (or the like) and then just keep pulling records from the feeding iterator until it hit the delimiter or a valid JCL starter ("//").

But you may see the issue here... I can't have 2 topmost tokenizers... either the tokenizer that looks for COL73 comments must be the top or the tokenizer that gets inline data must be at the top.

I imagine that perl parsers have the same challenge, since seeing

<<DELIM

isn't necessarily the end of the line, followed by the here document data. After all, you could see perl like:

my $this=$obj->ingest(<<DELIM)->reformat();
inline here document data
more data
DELIM

How would the tokenizer/parser know to tokenize the ")->reformat();" and then still grab the following records as-is? In the case of the inline JCL data, those lines are passed as-is, cols 73-80 are NOT comments in that case...

So, any takers on this? I know there will be tons of questions clarifying my needs and I'm happy to clarify as much as is needed.

Thanks in advance for any help...

Solution

In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.

Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.

For the parser, we treat introductions to heredocs <<END as string literals. But the lexer has to be extended to do the following:

When a heredoc introduction is encountered, it adds the terminator to the stack.
When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.

Take care to update the line number appropriately.

In a hand-written combined parser/lexer, this could be implemented like so:

use strict; use warnings; use 5.010;

my $s = <<'INPUT-END'; pos($s) = 0;
<<A <<B
body 1
A
body 2
B
<<C
body 3
C
INPUT-END

my @strs;
push @strs, parse_line() while pos($s) < length($s);
for my $i (0 .. $#strs) {
  say "STRING $i:";
  say $strs[$i];
}

sub parse_line {
  my @strings;
  my @heredocs;

  $s =~ /\G\s+/gc;

  # get the markers
  while ($s =~ /\G<<(\w+)/gc) {
    push @strings, '';
    push @heredocs, [ \$strings[-1], $1 ];
    $s =~ /\G[^\S\n]+/gc;  # spaces that are no newlines
  }

  # lex the EOL
  $s =~ /\G\n/gc or die "Newline expected";

  # process the deferred heredocs:
  while (my $heredoc = shift @heredocs) {
    my ($placeholder, $marker) = @$heredoc;
    $s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
    $$placeholder = $1;
  }

  return @strings;
}

Output:

STRING 0:
body 1

STRING 1:
body 2

STRING 2:
body 3

The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.