Search code examples
rregexlatexr-markdownpandoc

regex to match othermost tags (string pair)


I have a book that was written in Sweave and contains a lot of Latex that I am trying to convert to Rmarkdown. I have managed to write a script that converts most of the Latex to reasonable markdown but nested lists eludes me.

My idea so far is to detect when a list starts and ends and then pass it onto pandoc for conversion since I think making a parser would be make it unnecessarily difficult. The problem is detecting where the list starts and ends when the list is nested.

I found an example of matching bracketed tags here but I haven't been able to figure out how to convert it to match \begin and \end. (Regex match outer nested tags)

Example data:

meh meh

\begin{itemize}
\item something1
\begin{itemize}
\item something1.1
\item something1.2
\end{itemize}
\item something2
\begin{itemize}
\item something2.1

\item something2.2
\end{itemize}
\end{itemize}

blah blah

\begin{itemize}
\item somethingelse1
\item somethingelse2
\end{itemize}

the end.

There should be two matches above. One for the nested list and one for the below list. Can this be done with a regex or do you see some smarter way?


Solution

  • The regex that matches recursively between \begin{...} and \end{...} is a PCRE regex like

    (?s)\\begin\{[^{}]*}(?:(?!\\(?:end|begin)).|(?R))*\\end\{[^{}]*}
    

    A more efficient version of the regex (unrolled one, I also added a check for { after \begin and \end in the lookahead) is:

    \\begin\{[^{}]*}(?:[^\\]*(?:\\(?!(?:end|begin)\{)[^\\]*)*|(?R))*\\end\{[^{}]*}
    

    See the regex demo #1 and regex demo #2. Details:

    • (?s) - a singleline/dotall/s modifier that makes . match across line breaks
    • \\begin\{ - \begin{ string
    • [^{}]* - zero or more chars other than { and }
    • } - a } char
    • (?:(?!\\(?:end|begin)).|(?R))* - zero or more occurrences, as many as possible, of
      • (?!\\(?:end|begin)). - any one char that does not start a \end or \begin char sequence
      • | - or
      • (?R) - the whole regex pattern is recursed
    • \\end\{ - \end{ string
    • [^{}]*} - zero or more chars other than { and } and then a } char.

    Sample R code:

    x <- "meh meh\n\\begin{itemize}\n\\item something1\n\\begin{itemize}\n\\item something1.1\n\\item something1.2\n\\end{itemize}\n\\item something2\n\\begin{itemize}\n\\item something2.1\n\\item something2.2\n\\end{itemize}\n\\end{itemize}\nblah blah\n\\begin{itemize}\n\\item somethingelse1\n\\item somethingelse2\n\\end{itemize}\nthe end.\n"
    reg <- "(?s)\\\\begin\\{[^{}]*}(?:(?!\\\\(?:end|begin)).|(?R))*\\\\end\\{[^{}]*}"
    ## reg2 <- "\\\\begin\\{[^{}]*}(?:[^\\\\]*(?:\\\\(?!(?:end|begin)\\{)[^\\\\]*)*|(?R))*\\\\end\\{[^{}]*}"
    result <- regmatches(x, gregexpr(reg, x, perl=TRUE))
    

    Output:

    > result
    [[1]]
    [1] "\\begin{itemize}\n\\item something1\n\\begin{itemize}\n\\item something1.1\n\\item something1.2\n\\end{itemize}\n\\item something2\n\\begin{itemize}\n\\item something2.1\n\\item something2.2\n\\end{itemize}\n\\end{itemize}"
    [2] "\\begin{itemize}\n\\item somethingelse1\n\\item somethingelse2\n\\end{itemize}"