Search code examples
perlparsingjcl

Lexing/Parsing "here" documents


For those that are experts in lexing and parsing... I am attempting to write a series of programs in perl that would parse out IBM mainframe z/OS JCL for a variety of purposes, but am hitting a roadblock in methodology. I am mostly following the lexing/parsing ideology put forth in "Higher Order Perl" by Mark Jason Dominus, but there are some things that I can't quite figure out how to do.

JCL has what's called inline data, which is very similar to "here" documents. I am not quite sure how to lex these into tokens.

The layout for inline data is as follows:

//DDNAME   DD *
this is the inline data
this is some more inline data
/*
...

Conventionally, the "*" after the "DD" signifies that following lines are the inline data itself, terminated by either "/*" or the next valid JCL record (starting with "//" in the first 2 columns).

More advanced, the inline data could appear as such:

//DDNAME   DD *,DLM=ZZ
//THIS LOOKS LIKE JCL BUT IT'S ACTUALLY DATA
//MORE DATA MASQUERADING AS JCL
ZZ
...

Sometimes the inline data is itself JCL (perhaps to be pumped to a program or the internal reader, whatever).

But here's the rub. In JCL, the records are 80 bytes, fixed in length. Everything past column 72 (cols 73-80) is a "comment". As well, everything following a blank that follows valid JCL is likewise a comment. Since I am looking to manipulate JCL in my programs and spit it back out, I'd like to capture comments so that I can preserve them.

So, here's an example of inline comments in the case of inline data:

//DDNAME   DD *,DLM=ZZ THIS IS A COMMENT                                COL73DAT
data
...
ZZ
...more JCL

I originally thought that I could have my top-most lexer pull in a line of JCL and immediately create a non-token for cols 1-72 and then a token (['COL73COMMENT',$1]) for the column 73 comment, if any. This would then pass downstream to the next iterator/tokenizer a string of the cols 1-72 text followed by the col73 token.

But how would I, downstream from there, grab the inline data? I'd originally figured that the top-most tokenizer could look for a "DD \*(,DLM=(\S*))" (or the like) and then just keep pulling records from the feeding iterator until it hit the delimiter or a valid JCL starter ("//").

But you may see the issue here... I can't have 2 topmost tokenizers... either the tokenizer that looks for COL73 comments must be the top or the tokenizer that gets inline data must be at the top.

I imagine that perl parsers have the same challenge, since seeing

<<DELIM

isn't necessarily the end of the line, followed by the here document data. After all, you could see perl like:

my $this=$obj->ingest(<<DELIM)->reformat();
inline here document data
more data
DELIM

How would the tokenizer/parser know to tokenize the ")->reformat();" and then still grab the following records as-is? In the case of the inline JCL data, those lines are passed as-is, cols 73-80 are NOT comments in that case...

So, any takers on this? I know there will be tons of questions clarifying my needs and I'm happy to clarify as much as is needed.

Thanks in advance for any help...


Solution

  • In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.

    Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.

    For the parser, we treat introductions to heredocs <<END as string literals. But the lexer has to be extended to do the following:

    • When a heredoc introduction is encountered, it adds the terminator to the stack.
    • When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.

    Take care to update the line number appropriately.

    In a hand-written combined parser/lexer, this could be implemented like so:

    use strict; use warnings; use 5.010;
    
    my $s = <<'INPUT-END'; pos($s) = 0;
    <<A <<B
    body 1
    A
    body 2
    B
    <<C
    body 3
    C
    INPUT-END
    
    my @strs;
    push @strs, parse_line() while pos($s) < length($s);
    for my $i (0 .. $#strs) {
      say "STRING $i:";
      say $strs[$i];
    }
    
    sub parse_line {
      my @strings;
      my @heredocs;
    
      $s =~ /\G\s+/gc;
    
      # get the markers
      while ($s =~ /\G<<(\w+)/gc) {
        push @strings, '';
        push @heredocs, [ \$strings[-1], $1 ];
        $s =~ /\G[^\S\n]+/gc;  # spaces that are no newlines
      }
    
      # lex the EOL
      $s =~ /\G\n/gc or die "Newline expected";
    
      # process the deferred heredocs:
      while (my $heredoc = shift @heredocs) {
        my ($placeholder, $marker) = @$heredoc;
        $s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
        $$placeholder = $1;
      }
    
      return @strings;
    }
    

    Output:

    STRING 0:
    body 1
    
    STRING 1:
    body 2
    
    STRING 2:
    body 3
    

    The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.