Search code examples
parsinglexerfsmragel

How to parse template languages in Ragel?


I've been working on a parser for simple template language. I'm using Ragel.

The requirements are modest. I'm trying to find [[tags]] that can be embedded anywhere in the input string.

I'm trying to parse a simple template language, something that can have tags such as {{foo}} embedded within HTML. I tried several approaches to parse this but had to resort to using a Ragel scanner and use the inefficient approach of only matching a single character as a "catch all". I feel this is the wrong way to go about this. I'm essentially abusing the longest-match bias of the scanner to implement my default rule ( it can only be 1 char long, so it should always be the last resort ).

%%{

  machine parser;

  action start      { tokstart = p; }          
  action on_tag     { results << [:tag, data[tokstart..p]] }            
  action on_static  { results << [:static, data[p..p]] }            

  tag  = ('[[' lower+ ']]') >start @on_tag;

  main := |*
    tag;
    any      => on_static;
  *|;

}%%

( actions written in ruby, but should be easy to understand ).

How would you go about writing a parser for such a simple language? Is Ragel maybe not the right tool? It seems you have to fight Ragel tooth and nails if the syntax is unpredictable such as this.


Solution

  • Ragel works fine. You just need to be careful about what you're matching. Your question uses both [[tag]] and {{tag}}, but your example uses [[tag]], so I figure that's what you're trying to treat as special.

    What you want to do is eat text until you hit an open-bracket. If that bracket is followed by another bracket, then it's time to start eating lowercase characters till you hit a close-bracket. Since the text in the tag cannot include any bracket, you know that the only non-error character that can follow that close-bracket is another close-bracket. At that point, you're back where you started.

    Well, that's a verbatim description of this machine:

    tag = '[[' lower+ ']]';
    
    main := (
      (any - '[')*  # eat text
      ('[' ^'[' | tag)  # try to eat a tag
    )*;
    

    The tricky part is, where do you call your actions? I don't claim to have the best answer to that, but here's what I came up with:

    static char *text_start;
    
    %%{
      machine parser;
    
      action MarkStart { text_start = fpc; }
      action PrintTextNode {
        int text_len = fpc - text_start;
        if (text_len > 0) {
          printf("TEXT(%.*s)\n", text_len, text_start);
        }
      }
      action PrintTagNode {
        int text_len = fpc - text_start - 1;  /* drop closing bracket */
        printf("TAG(%.*s)\n", text_len, text_start);
      }
    
      tag = '[[' (lower+ >MarkStart) ']]' @PrintTagNode;
    
      main := (
        (any - '[')* >MarkStart %PrintTextNode
        ('[' ^'[' %PrintTextNode | tag) >MarkStart
      )* @eof(PrintTextNode);
    }%%
    

    There are a few non-obvious things:

    • The eof action is needed because %PrintTextNode is only ever invoked on leaving a machine. If the input ends with normal text, there will be no input to make it leave that state. Because it will also be called when the input ends with a tag, and there is no final, unprinted text node, PrintTextNode tests that it has some text to print.
    • The %PrintTextNode action nestled in after the ^'[' is needed because, though we marked the start when we hit the [, after we hit a non-[, we'll start trying to parse anything again and remark the start point. We need to flush those two characters before that happens, hence that action invocation.

    The full parser follows. I did it in C because that's what I know, but you should be able to turn it into whatever language you need pretty readily:

    /* ragel so_tag.rl && gcc so_tag.c -o so_tag */
    #include <stdio.h>
    #include <string.h>
    
    static char *text_start;
    
    %%{
      machine parser;
    
      action MarkStart { text_start = fpc; }
      action PrintTextNode {
        int text_len = fpc - text_start;
        if (text_len > 0) {
          printf("TEXT(%.*s)\n", text_len, text_start);
        }
      }
      action PrintTagNode {
        int text_len = fpc - text_start - 1;  /* drop closing bracket */
        printf("TAG(%.*s)\n", text_len, text_start);
      }
    
      tag = '[[' (lower+ >MarkStart) ']]' @PrintTagNode;
    
      main := (
        (any - '[')* >MarkStart %PrintTextNode
        ('[' ^'[' %PrintTextNode | tag) >MarkStart
      )* @eof(PrintTextNode);
    }%%
    
    %% write data;
    
    int
    main(void) {
      char buffer[4096];
      int cs;
      char *p = NULL;
      char *pe = NULL;
      char *eof = NULL;
    
      %% write init;
    
      do {
        size_t nread = fread(buffer, 1, sizeof(buffer), stdin);
        p = buffer;
        pe = p + nread;
        if (nread < sizeof(buffer) && feof(stdin)) eof = pe;
    
        %% write exec;
    
        if (eof || cs == %%{ write error; }%%) break;
      } while (1);
      return 0;
    }
    

    Here's some test input:

    [[header]]
    <html>
    <head><title>title</title></head>
    <body>
    <h1>[[headertext]]</h1>
    <p>I am feeling very [[emotion]].</p>
    <p>I like brackets: [ is cool. ] is cool. [] are cool. But [[tag]] is special.</p>
    </body>
    </html>
    [[footer]]
    

    And here's the output from the parser:

    TAG(header)
    TEXT(
    <html>
    <head><title>title</title></head>
    <body>
    <h1>)
    TAG(headertext)
    TEXT(</h1>
    <p>I am feeling very )
    TAG(emotion)
    TEXT(.</p>
    <p>I like brackets: )
    TEXT([ )
    TEXT(is cool. ] is cool. )
    TEXT([])
    TEXT( are cool. But )
    TAG(tag)
    TEXT( is special.</p>
    </body>
    </html>
    )
    TAG(footer)
    TEXT(
    )
    

    The final text node contains only the newline at the end of the file.