Search code examples
rknitrr-markdownpandocxtable

How does pandoc parse latex code in a .md file?


I am using Rstudio with knitr/rmarkdown/pandoc/latex to render .Rmd code to pdf. I have been struggling with certain latex code being rendered exactly as expected while very slightly different code ends up not being parsed correctly, which results in my .tex file containing lines like "\textbackslash{}begin{table}" instead of "\begin{table}".

Googling reveals similar mis-parsing when dealing with HTML, but I'm going straight from .Rmd to .md to .tex to .pdf.

This is all dependent on the particular version/platform of Rstudio I'm using, as well as the R packages knitr, xtable, rmarkdown, an rmarkdown template, etc., so I've been struggling to come up with an MWE.

(I did check that m,y version of pandoc is >= 1.13, because Googling suggested there's a bug in earlier versions that may be related.)

However, I've now got a kinda MWE that I can at least isolate to how pandoc is parsing its temporary .utf8.md file to create the .tex file.

The following markdown is parsed correctly from .md to .tex to .pdf:

# Data Profile
\begin{table}[htbp]
\centering
\parbox{12cm}{\caption{\small Record Count of Things Summarized in this Table.\label{MyRef}\vspace{4pt}}} 
{\small
\begin{tabular}{llrrr}
Thing & Characteristic & Aspect 1 & Aspect 2 & Aspect 3 \\ 
\hline
Some & data & rows & go & here \\ 
more & data & rows & go & here \\ 
\end{tabular}
}
\end{table}

But another bit of markdown, which is identical in every way to what's above except for lacking the \parbox around the \caption (which is how the R xtable package implements its own caption.width option), gets completely mangled. The relevant alternate line:

\caption{\small Record Count of Things Summarized in this Table.\label{MyRef}\vspace{4pt}}

These two markdown chunks are parsed according to the command below by Rstudio into the respective .tex chunks. I've satisfied myself that this is happening during pandoc processing because I can see that the .utf8.md files with and without the \parbox are otherwise identical, but the resulting .tex files differ, and everything else (the rmarkdown template, the pandoc options, etc.) stays exactly the same.

/usr/local/rstudio-0.98.1103/bin/pandoc/pandoc +RTS -K512m -RTS MyDoc.utf8.md --to latex --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output MyDoc.tex --filter /usr/local/rstudio-0.98.1103/bin/pandoc/pandoc-citeproc --template /home/user/R/x86_64-unknown-linux-gnu-library/3.2/MyRmarkdownTemplate/rmarkdown/templates/report/resources/template.tex --highlight-style tango --latex-engine pdflatex --bibliography bibliography.bib

Good:

\begin{table}[htbp]
\centering
\parbox{12cm}{\caption{\small Record Count of Things Summarized in This table.\label{MyRef}\vspace{4pt}}} 
{\small
\begin{tabular}{llrrr}
Thing & Characteristic & Aspect 1 & Aspect 2 & Aspect 3 \\ 
\hline
Some & data & rows & go & here \\ 
more & data & rows & go & here \\ 
\end{tabular}
}
\end{table}

Bad:

\textbackslash{}begin\{table\}{[}htbp{]} \centering
\textbackslash{}caption\{\small Record Count of Things Summarized in This Table.\label{MyRef}\vspace{4pt}\} \{\small

\begin{tabular}{llrrr}
Thing & Characteristic & Aspect 1 & Aspect 2 & Aspect 3 \\ 
\hline
Some & data & rows & go & here \\ 
more & data & rows & go & here \\ 
\end{tabular}
\} \textbackslash{}end\{table\}

In other words, for some reason, without that \parbox, pandoc doesn't realize that it's parsing latex until it reaches the \small inside the opening brace just before the \begin{tabular}. With the parbox, it knows it's latex right at the first backslash in the \begin{table}.

So my question is: What's up with this? And how do I fix it?


Solution

  • Turns out it was that \vspace inside the caption, or at least removing that leads to correct parsing. Must be non-standard enough that the LaTeX reader fails.

    See Yihui's comment on the original question. His link (https://github.com/jgm/pandoc/issues/2493) indicates that pandoc's LaTeX parser silently falls back to interpreting problematic LaTeX as plain text, which I think explains what's happening here.