Search code examples
parsingpest

How to not progress in Rust pest parser


I am trying to build a basic Latex parser using pest library. For the moment, I only care about lines, bold format and plain text. I am struggling with the latter. To simplify the problem, I assume that it cannot contain these two chars: \, }.

lines = { line ~ (NEWLINE ~ line)* }
line = { token* }

token = { text_bold | text_plain }

text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
inner = @{ char* }
char = {
    !("\\" | "}" | NEWLINE) ~ ANY
}

main = {
  SOI ~
  lines ~
  EOI
}

Using this webapp, we can see that my grammar eats the char after the plain text.

Input:
Before \textbf{middle} after.
New line

Output:
- lines > line
  - token > text_plain > inner: "Before "
  - token > text_plain > inner: "textbf{middle"
  - token > text_plain > inner: " after."
  - token > text_plain > inner: "New line"

If I replace ${ inner ~ ("\\" | "}" | NEWLINE) } by ${ inner }, it fails. If add the & in front of the suffix, it does not work either.

How can I change my grammar so that lines and bold tags are detected?


Solution

  • The rule

    text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
    

    certainly matches the character following inner (which must be a backslash, close brace, or newline). That's not what you want: you want the following character to be part of the next token. But it's definitely seems to me reasonable to ask what happened to that character, since the token corresponding to text_plain clearly doesn't show it.

    The answer, apparently, is a subtlety in how tokens are formed. According to the Pest book:

    When the rule starts being parsed, the starting part of the token is being produced, with the ending part being produced when the rule finishes parsing.

    The key here, it turns out, is what is not being said. ("\\" | "}" | NEWLINE) is not a rule, and therefore it does not trigger any token pairs. So when you iterate over the tokens inside text_plain, you only see the token generated by inner.

    None of that is really relevant, since text_plain should not attempt to match the following character in any event. I suppose you realised that, because you say you tried to change the rule to text_plain = { inner }, but that "failed". It would have been useful to know what "failure" meant here, but I suppose that it was because Pest complained about the attempt to use a repetition operator on a rule which can match the empty string.

    Since inner is a *-repetition, it can match the empty string; defining text_plain as a copy of inner means that text_plain can also match the empty string; that means that token ({ text_bold | text_plain }) can match the empty string, and that makes token* illegal because Pest doesn't allow applying repetition operators to a nullable rule. The simplest solution is to change inner from char* to char+, which forces it to match at least one character.

    In the following, I actually got rid of inner altogether, since it seems redundant:

    main = { SOI ~ lines ~ EOI }
    
    lines = { line ~ (NEWLINE ~ line)* ~ NEWLINE? }
    line = { token* }
    
    token = { text_bold | text_plain }
    
    text_bold = { "\\textbf{" ~ text_plain ~ "}" }
    text_plain = @{ char+ }
    char = {
        !("\\" | "}" | NEWLINE) ~ ANY
    }