I have a book that was written in Sweave and contains a lot of Latex that I am trying to convert to Rmarkdown. I have managed to write a script that converts most of the Latex to reasonable markdown but nested lists eludes me.
My idea so far is to detect when a list starts and ends and then pass it onto pandoc for conversion since I think making a parser would be make it unnecessarily difficult. The problem is detecting where the list starts and ends when the list is nested.
I found an example of matching bracketed tags here but I haven't been able to figure out how to convert it to match \begin and \end. (Regex match outer nested tags)
Example data:
meh meh
\begin{itemize}
\item something1
\begin{itemize}
\item something1.1
\item something1.2
\end{itemize}
\item something2
\begin{itemize}
\item something2.1
\item something2.2
\end{itemize}
\end{itemize}
blah blah
\begin{itemize}
\item somethingelse1
\item somethingelse2
\end{itemize}
the end.
There should be two matches above. One for the nested list and one for the below list. Can this be done with a regex or do you see some smarter way?
The regex that matches recursively between \begin{...}
and \end{...}
is a PCRE regex like
(?s)\\begin\{[^{}]*}(?:(?!\\(?:end|begin)).|(?R))*\\end\{[^{}]*}
A more efficient version of the regex (unrolled one, I also added a check for {
after \begin
and \end
in the lookahead) is:
\\begin\{[^{}]*}(?:[^\\]*(?:\\(?!(?:end|begin)\{)[^\\]*)*|(?R))*\\end\{[^{}]*}
See the regex demo #1 and regex demo #2. Details:
(?s)
- a singleline/dotall/s
modifier that makes .
match across line breaks\\begin\{
- \begin{
string[^{}]*
- zero or more chars other than {
and }
}
- a }
char(?:(?!\\(?:end|begin)).|(?R))*
- zero or more occurrences, as many as possible, of
(?!\\(?:end|begin)).
- any one char that does not start a \end
or \begin
char sequence|
- or(?R)
- the whole regex pattern is recursed\\end\{
- \end{
string[^{}]*}
- zero or more chars other than {
and }
and then a }
char.Sample R code:
x <- "meh meh\n\\begin{itemize}\n\\item something1\n\\begin{itemize}\n\\item something1.1\n\\item something1.2\n\\end{itemize}\n\\item something2\n\\begin{itemize}\n\\item something2.1\n\\item something2.2\n\\end{itemize}\n\\end{itemize}\nblah blah\n\\begin{itemize}\n\\item somethingelse1\n\\item somethingelse2\n\\end{itemize}\nthe end.\n"
reg <- "(?s)\\\\begin\\{[^{}]*}(?:(?!\\\\(?:end|begin)).|(?R))*\\\\end\\{[^{}]*}"
## reg2 <- "\\\\begin\\{[^{}]*}(?:[^\\\\]*(?:\\\\(?!(?:end|begin)\\{)[^\\\\]*)*|(?R))*\\\\end\\{[^{}]*}"
result <- regmatches(x, gregexpr(reg, x, perl=TRUE))
Output:
> result
[[1]]
[1] "\\begin{itemize}\n\\item something1\n\\begin{itemize}\n\\item something1.1\n\\item something1.2\n\\end{itemize}\n\\item something2\n\\begin{itemize}\n\\item something2.1\n\\item something2.2\n\\end{itemize}\n\\end{itemize}"
[2] "\\begin{itemize}\n\\item somethingelse1\n\\item somethingelse2\n\\end{itemize}"