How to parse template languages in Ragel?

I've been working on a parser for simple template language. I'm using Ragel.

The requirements are modest. I'm trying to find [[tags]] that can be embedded anywhere in the input string.

I'm trying to parse a simple template language, something that can have tags such as {{foo}} embedded within HTML. I tried several approaches to parse this but had to resort to using a Ragel scanner and use the inefficient approach of only matching a single character as a "catch all". I feel this is the wrong way to go about this. I'm essentially abusing the longest-match bias of the scanner to implement my default rule ( it can only be 1 char long, so it should always be the last resort ).

%%{

  machine parser;

  action start      { tokstart = p; }          
  action on_tag     { results << [:tag, data[tokstart..p]] }            
  action on_static  { results << [:static, data[p..p]] }            

  tag  = ('[[' lower+ ']]') >start @on_tag;

  main := |*
    tag;
    any      => on_static;
  *|;

}%%

( actions written in ruby, but should be easy to understand ).

How would you go about writing a parser for such a simple language? Is Ragel maybe not the right tool? It seems you have to fight Ragel tooth and nails if the syntax is unpredictable such as this.

Solution

Ragel works fine. You just need to be careful about what you're matching. Your question uses both [[tag]] and {{tag}}, but your example uses [[tag]], so I figure that's what you're trying to treat as special.

What you want to do is eat text until you hit an open-bracket. If that bracket is followed by another bracket, then it's time to start eating lowercase characters till you hit a close-bracket. Since the text in the tag cannot include any bracket, you know that the only non-error character that can follow that close-bracket is another close-bracket. At that point, you're back where you started.

Well, that's a verbatim description of this machine:

tag = '[[' lower+ ']]';

main := (
  (any - '[')*  # eat text
  ('[' ^'[' | tag)  # try to eat a tag
)*;

The tricky part is, where do you call your actions? I don't claim to have the best answer to that, but here's what I came up with:

static char *text_start;

%%{
  machine parser;

  action MarkStart { text_start = fpc; }
  action PrintTextNode {
    int text_len = fpc - text_start;
    if (text_len > 0) {
      printf("TEXT(%.*s)\n", text_len, text_start);
    }
  }
  action PrintTagNode {
    int text_len = fpc - text_start - 1;  /* drop closing bracket */
    printf("TAG(%.*s)\n", text_len, text_start);
  }

  tag = '[[' (lower+ >MarkStart) ']]' @PrintTagNode;

  main := (
    (any - '[')* >MarkStart %PrintTextNode
    ('[' ^'[' %PrintTextNode | tag) >MarkStart
  )* @eof(PrintTextNode);
}%%

There are a few non-obvious things:

The eof action is needed because %PrintTextNode is only ever invoked on leaving a machine. If the input ends with normal text, there will be no input to make it leave that state. Because it will also be called when the input ends with a tag, and there is no final, unprinted text node, PrintTextNode tests that it has some text to print.
The %PrintTextNode action nestled in after the ^'[' is needed because, though we marked the start when we hit the [, after we hit a non-[, we'll start trying to parse anything again and remark the start point. We need to flush those two characters before that happens, hence that action invocation.

The full parser follows. I did it in C because that's what I know, but you should be able to turn it into whatever language you need pretty readily:

/* ragel so_tag.rl && gcc so_tag.c -o so_tag */
#include <stdio.h>
#include <string.h>

static char *text_start;

%%{
  machine parser;

  action MarkStart { text_start = fpc; }
  action PrintTextNode {
    int text_len = fpc - text_start;
    if (text_len > 0) {
      printf("TEXT(%.*s)\n", text_len, text_start);
    }
  }
  action PrintTagNode {
    int text_len = fpc - text_start - 1;  /* drop closing bracket */
    printf("TAG(%.*s)\n", text_len, text_start);
  }

  tag = '[[' (lower+ >MarkStart) ']]' @PrintTagNode;

  main := (
    (any - '[')* >MarkStart %PrintTextNode
    ('[' ^'[' %PrintTextNode | tag) >MarkStart
  )* @eof(PrintTextNode);
}%%

%% write data;

int
main(void) {
  char buffer[4096];
  int cs;
  char *p = NULL;
  char *pe = NULL;
  char *eof = NULL;

  %% write init;

  do {
    size_t nread = fread(buffer, 1, sizeof(buffer), stdin);
    p = buffer;
    pe = p + nread;
    if (nread < sizeof(buffer) && feof(stdin)) eof = pe;

    %% write exec;

    if (eof || cs == %%{ write error; }%%) break;
  } while (1);
  return 0;
}

Here's some test input:

[[header]]
<html>
<head><title>title</title></head>
<body>
<h1>[[headertext]]</h1>
<p>I am feeling very [[emotion]].</p>
<p>I like brackets: [ is cool. ] is cool. [] are cool. But [[tag]] is special.</p>
</body>
</html>
[[footer]]

And here's the output from the parser:

TAG(header)
TEXT(
<html>
<head><title>title</title></head>
<body>
<h1>)
TAG(headertext)
TEXT(</h1>
<p>I am feeling very )
TAG(emotion)
TEXT(.</p>
<p>I like brackets: )
TEXT([ )
TEXT(is cool. ] is cool. )
TEXT([])
TEXT( are cool. But )
TAG(tag)
TEXT( is special.</p>
</body>
</html>
)
TAG(footer)
TEXT(
)

The final text node contains only the newline at the end of the file.