Search code examples
rubytreetop

Roman numerals in treetop grammar


I want to parse an ordered list, which is something like:

I - Something
II - Something else...
IX - Something weird
XIII - etc

So far, my treetop grammar is:

rule text
    roman_numeral separator text newline
end

rule roman_numeral
    &. ('MMM' / 'MM' / 'M')? (('C' [DM]) / 
    ('D'? ('CCC' / 'CC' / 'C')?))? (('X' [LC]) / 
    ('L'? ('XXX' / 'XX' / 'X')?))? (('I' [VX]) / 
    ('V'? ('III' / 'II' / 'I')?))?
end

rule separator
    [\s] "-" [\s]
end

rule text
    (!"\n" .)*
end

rule newline
    ["\n"]
end

However, the corresponding parser is unable to parse the text. What is broken?


Solution

  • You accidentally overloaded text. Rename the first to line, and then add another rule for lines.

    The quotes around newline also seem unnecessary.

    Side tip - you can reuse the newline rule in your text rule to keep it DRY.

    grammar Roman
    
      rule lines
        line*
      end
    
      rule line
        roman_numeral separator text newline
      end
    
      rule roman_numeral
        &. ('MMM' / 'MM' / 'M')? (('C' [DM]) /
        ('D'? ('CCC' / 'CC' / 'C')?))? (('X' [LC]) /
        ('L'? ('XXX' / 'XX' / 'X')?))? (('I' [VX]) /
        ('V'? ('III' / 'II' / 'I')?))?
      end
    
      rule separator
        [\s] "-" [\s]
      end
    
      rule text
        (!newline .)*
      end
    
      rule newline
        [\n]
      end
    
    end
    

    Update

    You can simplify the grammar a little bit by removing the negative lookahead and the single character classes.

    rule separator
      " - "
    end
    
    rule text
      [^\n]*
    end
    

    The resulting syntax graph becomes much simpler.