python parsing lexer textile lark-parser

Parsing elements/fields without end marker, problems with non-greedy regex, usage of a custom lexer

I want to be able to parse files in the Textile markup lanaguage (https://textile-lang.com/) in order to convert it to LaTeX. The file I have is a bit of an extension of Textile, since it adds fields and footnotes. An example file is given below.

test.textile

#[contents]#
p. This is a paragraph.

With some bullet points:

* Bullet 1
* Bullet 2
* Bullet 3

And a code block:

bc.. # Program to display the Fibonacci sequence up to n-th term

"""\
* Program to display the Fibonacci sequence up to n-th term
"""

search_string = r"<5>"

nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

p. And after the block of code another paragraph, with a footnote<1>.

bc. fn1. This is the footnote contents

p. And here is another paragraph

#[second_field]#
Some more contents

To parse the file I have the following parser.

parser.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)

And the following grammar. This grammar does not parse bullet points and footnotes yet, because I run into another problem already.

grammar.lark

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR

FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s

When I run the parser, I get the following output.

output

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('FIELD_NAME', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])

The whole rest of the file is parsed as the paragraph, which is sort of correct, since /.+/s can match anything. So, I change the definition of STR to /.+?/s to make it non-greedy, but now the output is as follows (pretty printed):

output

elements
  element
    field
      #[
      field_name        contents
      ]#

    content
      paragraph
        p.
        T
      paragraph h
      paragraph i
      paragraph s
      paragraph
      paragraph i
      paragraph s
      paragraph
      paragraph a
      paragraph
--snip--
      paragraph

      paragraph #
      paragraph [
      paragraph s
      paragraph e
      paragraph c
      paragraph o
      paragraph n
      paragraph d
      paragraph _
      paragraph f
      paragraph i
      paragraph e
      paragraph l
      paragraph d
      paragraph ]
      paragraph #

It parses each character as a paragraph. And it still parses the whole file as paragraph elements.

My first solution to this problem was to create a lexer, which creates tokens for FIELD_START, FIELD_END, PARAGRAPH_START, CODE_BLOCK_START and footnote-related tokens.

My lexer looks as follows:

from lark.lexer import Lexer, Token
import re

class MyLexer(Lexer):
    def __init__(self, *args, **kwargs):
        pass

    def lex(self, data):
        tokens = {
            "FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
            "FIELD_END": r"\]#[\r\n]",
            "FOOTNOTE_ANCHOR": r"<\d>",
            "FOOTNOTE_START": r"bc. fn\d. ",
            "PARAGRAPH_START": r"p\. ",
            "CODE_BLOCK_START": r"bc\.\.? ",
        }
        regex = '|'.join([f"({r})" for r in tokens.values()])
        for x in re.split(regex, data):
            if not x:
                continue
            for token_type, token_regex in tokens.items():
                if re.match(token_regex, x):
                    yield Token(token_type, x)
                    break
            else:
                yield Token("STR", x)

parser = Lark(grammar, lexer=MyLexer, start='elements')

It creates a regex based on the given tokens, then splits the whole string by the regex and returns everything as a token, either a defined token, or "STR". The new grammar looks as follows:

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR 
    | paragraph 
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR 
    | FOOTNOTE_ANCHOR
    | footnote
footnote: FOOTNOTE_START STR

%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR

The output of the parser is as follows:

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]), 
            Tree(Token('RULE', 'code_block'), [
                Token('CODE_BLOCK_START', 'bc.. '), 
                Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('FOOTNOTE_ANCHOR', '<5>'), 
                Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And after the block of code another paragraph, with a footnote'), 
                Token('FOOTNOTE_ANCHOR', '<1>'), 
                Token('STR', '.\n\n'), 
                Tree(Token('RULE', 'footnote'), [
                    Token('FOOTNOTE_START', 'bc. fn1. '), 
                    Token('STR', 'This is the footnote contents\n\n')])]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And here is another paragraph')])])]), 
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '\n\n#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'second_field')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Token('STR', 'Some more contents\n')])])])

This correctly parses the different fields and footnotes, however, the code block is interrupted by a detected FOOTNOTE_ANCHOR. Because the Lexer does not know context it tries to parse footnote anchors in code as well. The same problem would occur when trying to parse bullet points.

What is the best solution to the problem? Do I really need a lexer? Is my lexer implemented correctly? (I can find very little examples on how to use a custom lexer for text). Can I maybe only lex some tokens and leave the rest to a "parent" lexer?

Solution

Based on recognizing multi-line sections with lark grammar I was able to find a solution.

The important part is to not use /.+/s to match multiple lines, as then the parser will not have the opportunity to match other tokens. It is better to match line by line, so that the parser has opportunity to match a new rule for each line. I also switched the parser to "lalr", it did not work with the standard parser.

parser.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())

grammar.lark

elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE

FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/

%import common.NEWLINE

The output:

elements
  element
    field
      #[
      field_name        contents
      ]#


    content
      paragraph
        p.
        paragraph_line
          This is a paragraph.



        paragraph_line
          With some bullet points:



        bullets
          *
          paragraph_line
             Bullet 1


          *
          paragraph_line
             Bullet 2


          *
          paragraph_line
             Bullet 3



        paragraph_line
          And a code block:



      code_block
        bc..
        # Program to display the Fibonacci sequence up to n-th term



        """\


        * Program to display the Fibonacci sequence up to n-th term


        """



        search_string = r"<5>"



        nterms = int(input("How many terms? "))



        # first two terms


        n1, n2 = 0, 1


        count = 0



      paragraph
        p.
        paragraph_line
          And after the block of code another paragraph, with a footnote
          <1>
          .



        footnote
          bc. fn1.
          This is the footnote contents



      paragraph
        p.
        paragraph_line
          And here is another paragraph



  element
    field
      #[
      field_name        second_field
      ]#


    content
      paragraph
        paragraph_line
          Some more contents

Note that the parser also correctly pasrses bullets and footnotes. In order to parse footnote anchors within a line, I have made a special PARAGRAPH_LINE, which stops at the first footnote it encounters or at the end of the line. Also note that line has precedence over bullets, so bullets will not be matched in a code block, (since it looks for a normal line), only in a paragraph.