Search code examples
pythonparsinglexertextilelark-parser

Parsing elements/fields without end marker, problems with non-greedy regex, usage of a custom lexer


I want to be able to parse files in the Textile markup lanaguage (https://textile-lang.com/) in order to convert it to LaTeX. The file I have is a bit of an extension of Textile, since it adds fields and footnotes. An example file is given below.

test.textile

#[contents]#
p. This is a paragraph.

With some bullet points:

* Bullet 1
* Bullet 2
* Bullet 3

And a code block:

bc.. # Program to display the Fibonacci sequence up to n-th term

"""\
* Program to display the Fibonacci sequence up to n-th term
"""

search_string = r"<5>"

nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

p. And after the block of code another paragraph, with a footnote<1>.

bc. fn1. This is the footnote contents

p. And here is another paragraph

#[second_field]#
Some more contents

To parse the file I have the following parser.

parser.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)

And the following grammar. This grammar does not parse bullet points and footnotes yet, because I run into another problem already.

grammar.lark

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR

FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s

When I run the parser, I get the following output.

output

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('FIELD_NAME', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])

The whole rest of the file is parsed as the paragraph, which is sort of correct, since /.+/s can match anything. So, I change the definition of STR to /.+?/s to make it non-greedy, but now the output is as follows (pretty printed):

output

elements
  element
    field
      #[
      field_name        contents
      ]#

    content
      paragraph
        p.
        T
      paragraph h
      paragraph i
      paragraph s
      paragraph
      paragraph i
      paragraph s
      paragraph
      paragraph a
      paragraph
--snip--
      paragraph

      paragraph #
      paragraph [
      paragraph s
      paragraph e
      paragraph c
      paragraph o
      paragraph n
      paragraph d
      paragraph _
      paragraph f
      paragraph i
      paragraph e
      paragraph l
      paragraph d
      paragraph ]
      paragraph #

It parses each character as a paragraph. And it still parses the whole file as paragraph elements.

My first solution to this problem was to create a lexer, which creates tokens for FIELD_START, FIELD_END, PARAGRAPH_START, CODE_BLOCK_START and footnote-related tokens.

My lexer looks as follows:

from lark.lexer import Lexer, Token
import re

class MyLexer(Lexer):
    def __init__(self, *args, **kwargs):
        pass

    def lex(self, data):
        tokens = {
            "FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
            "FIELD_END": r"\]#[\r\n]",
            "FOOTNOTE_ANCHOR": r"<\d>",
            "FOOTNOTE_START": r"bc. fn\d. ",
            "PARAGRAPH_START": r"p\. ",
            "CODE_BLOCK_START": r"bc\.\.? ",
        }
        regex = '|'.join([f"({r})" for r in tokens.values()])
        for x in re.split(regex, data):
            if not x:
                continue
            for token_type, token_regex in tokens.items():
                if re.match(token_regex, x):
                    yield Token(token_type, x)
                    break
            else:
                yield Token("STR", x)

parser = Lark(grammar, lexer=MyLexer, start='elements')

It creates a regex based on the given tokens, then splits the whole string by the regex and returns everything as a token, either a defined token, or "STR". The new grammar looks as follows:

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR 
    | paragraph 
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR 
    | FOOTNOTE_ANCHOR
    | footnote
footnote: FOOTNOTE_START STR

%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR

The output of the parser is as follows:

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]), 
            Tree(Token('RULE', 'code_block'), [
                Token('CODE_BLOCK_START', 'bc.. '), 
                Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('FOOTNOTE_ANCHOR', '<5>'), 
                Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And after the block of code another paragraph, with a footnote'), 
                Token('FOOTNOTE_ANCHOR', '<1>'), 
                Token('STR', '.\n\n'), 
                Tree(Token('RULE', 'footnote'), [
                    Token('FOOTNOTE_START', 'bc. fn1. '), 
                    Token('STR', 'This is the footnote contents\n\n')])]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And here is another paragraph')])])]), 
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '\n\n#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'second_field')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Token('STR', 'Some more contents\n')])])])

This correctly parses the different fields and footnotes, however, the code block is interrupted by a detected FOOTNOTE_ANCHOR. Because the Lexer does not know context it tries to parse footnote anchors in code as well. The same problem would occur when trying to parse bullet points.

What is the best solution to the problem? Do I really need a lexer? Is my lexer implemented correctly? (I can find very little examples on how to use a custom lexer for text). Can I maybe only lex some tokens and leave the rest to a "parent" lexer?


Solution

  • Based on recognizing multi-line sections with lark grammar I was able to find a solution.

    The important part is to not use /.+/s to match multiple lines, as then the parser will not have the opportunity to match other tokens. It is better to match line by line, so that the parser has opportunity to match a new rule for each line. I also switched the parser to "lalr", it did not work with the standard parser.

    parser.py

    from lark import Lark
    
    def read_file(filename):
        with open(filename) as f:
            return f.read()
    
    grammar = read_file('grammar.lark')
    parser = Lark(grammar, start='elements', parser="lalr")
    textile = read_file('test.textile')
    tree = parser.parse(textile)
    print(tree.pretty())
    

    grammar.lark

    elements: element+
    ?element: field content
    ?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
    field_name: FIELD_NAME
    content: contents*
    ?contents: paragraph
        | code_block
    code_block: CODE_BLOCK_START (LINE NEWLINE)+
    paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
    bullets: (BULLET paragraph_line)+
    footnote: FOOTNOTE_START LINE NEWLINE
    paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE
    
    FIELD_START: "#["
    FIELD_NAME: /[^\]]+/
    FIELD_END: "]#"
    FOOTNOTE_ANCHOR: /<\d>/
    FOOTNOTE_START: /bc\. fn\d\. /
    CODE_BLOCK_START: /bc\.\.? /
    PARAGRAPH_START: /p\. /
    LINE.-1: /.+/
    BULLET.-2: "*"
    PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/
    
    %import common.NEWLINE
    

    The output:

    elements
      element
        field
          #[
          field_name        contents
          ]#
    
    
        content
          paragraph
            p.
            paragraph_line
              This is a paragraph.
    
    
    
            paragraph_line
              With some bullet points:
    
    
    
            bullets
              *
              paragraph_line
                 Bullet 1
    
    
              *
              paragraph_line
                 Bullet 2
    
    
              *
              paragraph_line
                 Bullet 3
    
    
    
            paragraph_line
              And a code block:
    
    
    
          code_block
            bc..
            # Program to display the Fibonacci sequence up to n-th term
    
    
    
            """\
    
    
            * Program to display the Fibonacci sequence up to n-th term
    
    
            """
    
    
    
            search_string = r"<5>"
    
    
    
            nterms = int(input("How many terms? "))
    
    
    
            # first two terms
    
    
            n1, n2 = 0, 1
    
    
            count = 0
    
    
    
          paragraph
            p.
            paragraph_line
              And after the block of code another paragraph, with a footnote
              <1>
              .
    
    
    
            footnote
              bc. fn1.
              This is the footnote contents
    
    
    
          paragraph
            p.
            paragraph_line
              And here is another paragraph
    
    
    
      element
        field
          #[
          field_name        second_field
          ]#
    
    
        content
          paragraph
            paragraph_line
              Some more contents
    

    Note that the parser also correctly pasrses bullets and footnotes. In order to parse footnote anchors within a line, I have made a special PARAGRAPH_LINE, which stops at the first footnote it encounters or at the end of the line. Also note that line has precedence over bullets, so bullets will not be matched in a code block, (since it looks for a normal line), only in a paragraph.