I am knitting an R markdown to PDF. One of my labels in the plots contains expressions $\times10^23$.
---
title: "Untitled"
output: pdf_document
date: "2023-06-24"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
```{r}
plot(1, type="n", ylab=expression(paste("Count (\u00d7",10^23/L,")")))
```
However, the knitted PDF displays a different symbol of 10^23 and 10ˆ23. Copying code will return an error due to this different Unicode symbol.
Thanks for any comments.
For reasons I don't completely understand (see below), what solved this issue for me was using xelatex (or lualatex) as the engine to compile the .tex
file to pdf.
There is a setting in RStudio that supposedly sets this globally in Tools > Global Options ...
, choose "Sweave" in the left-hand pane, and change the drop-down beside "Typeset LaTex into PDF using:" to "XeLaTeX".
Changing that setting didn't actually change anything when compiling R Markdown files for me (the default is still pdflatex), but I was able to specify the latex engine in the YAML header of a file by replacing output: pdf_document
with this:
output:
pdf_document:
latex_engine: xelatex
A pdf produced from a file with this in the header should have the expected caret characters in the R code chunks (and other places, too).
R Markdown outputs to pdf by first running the R code and collecting output (via the knit()
function), then converting the resulting Markdown to LaTeX (.tex
) using Pandoc, and then compiling the LaTeX file to pdf using a TeX engine (see the R Markdown documentation for details). Both pandoc and the TeX engine (and maybe even the pdf program displaying the file) have a role to play in what character ends up on the screen to be copied.
The character we want to have in the output, so that we can paste it into the console, is ASCII code 94 (^), which is a "Caret - circumflex" character; but what we get from R Markdown's default settings is ASCII code 136 (ˆ), which is a "Modifier letter circumflex accent" - in other words, a character accent without a letter underneath. I don't think this is R Markdown's fault, however.
From what I can tell, Pandoc does a few things that are relevant to how the caret character is treated in the output. In particular, Pandoc:
r
code chunks with echo=TRUE
) into a custom Verbatim
environment from the favyvrb
package in LaTeX.verbatim
environment (LaTeX is case-sensitive, so this is not the same as a Verbatim
environment).\^{}
in the latex output --- except for the contents of verbatim
environments (i.e., plain code blocks, including console output), and Math Mode ($...$
).
\^
is replaced with \textasciicircum
by Pandoc: this command is not allowed in Math Mode, and the latex engine generates a warning, but continues. Ironically, in this situation, the output in the pdf is ASCII character code 136, unlike other contexts in a LaTeX document (see below).\^{}
is not only not allowed in Math Mode, it would cause an error when compiled by the latex engine. This might be why Pandoc uses \textasciicircum
in this situation, though there are better alternatives in Math Mode (see links below).There are many ways to represent this character in latex, and it depends on the context (plain text, verbatim, or Math Mode). See "The Comprehensive LaTeX Symbol List" and this StackOverflow answer for some of the options and details.
\^{}
and\textasciicircum
are often presented as 'equivalent' representations in latex, at least in text mode. But in my experience, it is not always so. Usually (but not always), I find that with the pdflatex engine, \^{}
produces an accent character (ASCII code 136, not what we want), whereas \textasciicircum
produces a caret (ASCII code 94, what we want).
This makes sense to me, given that the \^{}
command is also used by latex to add a circumflex accent to a letter, by putting that letter in the braces as an argument (e.g., \^{o}
produces "ô"). So without an argument, the semantic meaning of this command is an accent without a letter.
I also know that XeTeX and luaTeX handle input enconding and fonts differently from pdfTeX. The fact that changing the TeX engine results in different characters in the output suggests the issue might have to do with fonts, but it could also be how they process the commands themselves. But that is where my knowledge and understanding run out.
Why does Pandoc replace "^" with "\^{}
" instead of "\textasciicircum
"? I don't know, but I've asked the question on the pandoc-discuss mailing list, and I await a reply as of this writing. EDIT: is this a bug in Pandoc?
Why do XeTeX and luaTeX render \^{}
differently than pdfTeX? In my limited experience, \^{}
and \textasciicircum
can be different when using pdflatex (but not in a 'standalone' document class, which I don't understand why), but they both produce the same output character when using xelatex or lualatex.
Strangely, the entries for ASCII code 94 (table 583) and 136 (table 585) in "The Comprehensive LaTeX Symbol List" appear to be reversed: the character appearing beside ASCII character code 94 is actually 136, according to asciivalue.com, and vice versa. But the characters shown are what is produced by the commands shown: i.e., \^{}
produces ASCII character code 136 and \textasciicircum
code 94, despite what the document claims in the tables. Is this an error in the document, or a bug in pdfLaTeX? or something else? Since these are functionally and semantically different ASCII characters, why does this document (and others) claim that "\^{}
" and "\textasciicircum
"? are equivalent?