Search code examples
pythonrestructuredtextlark

Use lark to analyze reST markup language like sections


I would like to define one basic grammar such as to start to work with lark. Here is my M(not)WE.

from lark import Lark

GRAMMAR = r"""
?start: _NL* (day_heading)*

day_heading : "==" _NL day_nb _NL "==" _NL+ (paragraph _NL)*
day_nb      : /\d{2}/
paragraph   : /[^\n={2}]+/ (_NL+ paragraph)*
_NL         : /(\r?\n[\t ]*)+/
"""

parser = Lark(GRAMMAR)

tree = parser.parse("""


==
12
==

Bla, bla
Bli, Bli



Blu, Blu


==
10
==


Blo, blo


    """)

print(tree.pretty())

This prints :

start
  day_heading
    day_nb      12
    paragraph
      Bla, bla
      paragraph
        Bli, Bli
        paragraph       Blu, Blu
  day_heading
    day_nb      10
    paragraph   Blo, blo

The tree I want is the following one.

start
  day_heading
    day_nb      12
    paragraph
      line      Bla, bla
      line      Bli, Bli
      line      Blu, Blu
  day_heading
    day_nb      10
    paragraph
      line      Blo, blo

How can I modify my EBNF?


Solution

  • Here is a possible answer: I have misused a recursive rule in my initial question.

    Replacing _NL by NL allows to keep the new lines.

    from lark import Lark
    
    GRAMMAR = r"""
    ?start: _NL* (day_heading)*
    
    day_heading : "==" _NL day_nb _NL "==" _NL+ (paragraph)+
    day_nb      : /\d{2}/
    
    paragraph : (line _NL)+
    
    line : /[^\n={2}]+/
    _NL  : /(\r?\n[\t ]*)+/
    """
    
    parser = Lark(GRAMMAR)
    
    tree = parser.parse("""
    
    
    ==
    12
    ==
    
    Bla, bla
    Bli, Bli
    
    
    
    Blu, Blu
    
    
    ==
    10
    ==
    
    
    Blo, blo
    
    
        """)
    
    print(tree.pretty())
    

    This produces:

    start
      day_heading
        day_nb      12
        paragraph
          line      Bla, bla
          line      Bli, Bli
          line      Blu, Blu
      day_heading
        day_nb      10
        paragraph
          line      Blo, blo