regex to match othermost tags (string pair)

I have a book that was written in Sweave and contains a lot of Latex that I am trying to convert to Rmarkdown. I have managed to write a script that converts most of the Latex to reasonable markdown but nested lists eludes me.

My idea so far is to detect when a list starts and ends and then pass it onto pandoc for conversion since I think making a parser would be make it unnecessarily difficult. The problem is detecting where the list starts and ends when the list is nested.

I found an example of matching bracketed tags here but I haven't been able to figure out how to convert it to match \begin and \end. (Regex match outer nested tags)

Example data:

meh meh

\begin{itemize}
\item something1
\begin{itemize}
\item something1.1
\item something1.2
\end{itemize}
\item something2
\begin{itemize}
\item something2.1

\item something2.2
\end{itemize}
\end{itemize}

blah blah

\begin{itemize}
\item somethingelse1
\item somethingelse2
\end{itemize}

the end.

There should be two matches above. One for the nested list and one for the below list. Can this be done with a regex or do you see some smarter way?

Solution

The regex that matches recursively between \begin{...} and \end{...} is a PCRE regex like

(?s)\\begin\{[^{}]*}(?:(?!\\(?:end|begin)).|(?R))*\\end\{[^{}]*}

A more efficient version of the regex (unrolled one, I also added a check for { after \begin and \end in the lookahead) is:

\\begin\{[^{}]*}(?:[^\\]*(?:\\(?!(?:end|begin)\{)[^\\]*)*|(?R))*\\end\{[^{}]*}

See the regex demo #1 and regex demo #2. Details:

(?s) - a singleline/dotall/s modifier that makes . match across line breaks
\\begin\{ - \begin{ string
[^{}]* - zero or more chars other than { and }
} - a } char
(?:(?!\\(?:end|begin)).|(?R))* - zero or more occurrences, as many as possible, of
- (?!\\(?:end|begin)). - any one char that does not start a \end or \begin char sequence
- | - or
- (?R) - the whole regex pattern is recursed
\\end\{ - \end{ string
[^{}]*} - zero or more chars other than { and } and then a } char.

Sample R code:

x <- "meh meh\n\\begin{itemize}\n\\item something1\n\\begin{itemize}\n\\item something1.1\n\\item something1.2\n\\end{itemize}\n\\item something2\n\\begin{itemize}\n\\item something2.1\n\\item something2.2\n\\end{itemize}\n\\end{itemize}\nblah blah\n\\begin{itemize}\n\\item somethingelse1\n\\item somethingelse2\n\\end{itemize}\nthe end.\n"
reg <- "(?s)\\\\begin\\{[^{}]*}(?:(?!\\\\(?:end|begin)).|(?R))*\\\\end\\{[^{}]*}"
## reg2 <- "\\\\begin\\{[^{}]*}(?:[^\\\\]*(?:\\\\(?!(?:end|begin)\\{)[^\\\\]*)*|(?R))*\\\\end\\{[^{}]*}"
result <- regmatches(x, gregexpr(reg, x, perl=TRUE))

Output:

> result
[[1]]
[1] "\\begin{itemize}\n\\item something1\n\\begin{itemize}\n\\item something1.1\n\\item something1.2\n\\end{itemize}\n\\item something2\n\\begin{itemize}\n\\item something2.1\n\\item something2.2\n\\end{itemize}\n\\end{itemize}"
[2] "\\begin{itemize}\n\\item somethingelse1\n\\item somethingelse2\n\\end{itemize}"