Search code examples
parsingregular-languageragel

What is the best way to capture ambiguous segments of text?


What would be the best way to capture the inner text in the following case?

inner_text = any*;
tag_cdata = '<![CDATA[' inner_text >cdata_start %cdata_end ']]>';

The problem is, it seems like the cdata_end action fires several times due to the fact that inner_text could match ].


Solution

  • I found the solution. You need to handle non-determinism. It wasn't clear initially, but the correct solution is something like this:

    inner_text = any*;
    tag_cdata = '<![CDATA[' inner_text >text_begin %text_end ']]>' %cdata_end;
    
    action text_begin {
        text_begin_at = p;
    }
    
    action text_end {
        text_end_at = p;
    }
    
    action cdata_end {
        delegate.cdata(data.byteslice(text_begin_at, text_end_at-text_begin_at))
    }
    

    Essentially, you wait until you are sure you parsed a complete CDATA tag before firing the callback, using information you previously captured.

    In addition, I found that some forms of non-determinism in Ragel need to be explicitly handled using priorities. While this seems a bit ugly, it is the only solution in some cases.

    When dealing with a pattern such as (a+ >a_begin %a_end | b)* you will find that the events are called for every single a encountered, rather than at the longest sub-sequence. This ambiguity, in some cases, can be solved using the longest match kleene star **. What this does is it prefers to match the existing pattern rather than wrapping around.

    What was surprising to me, is that this actually modifies the way events are called, too. As an example, this produces a machine which is unable to buffer more than one character at a time when invoking callbacks:

    %%{
      machine example;
    
      action a_begin {}
      action a_end {}
    
      main := ('a'+ >a_begin %a_end | 'b')*;
    }%%
    

    Produces:

    Non-greedy Parser

    You'll notice that it calls a_begin and a_end every time.

    In contrast, we can make the inner loop and event handling greedy:

    %%{
      machine example;
    
      action a_begin {}
      action a_end {}
    
      main := ('a'+ >a_begin %a_end | 'b')**;
    }%%
    

    which produces:

    Greedy Parser