Search code examples
rr-markdownpandocbookdown

HTML-formatted hyperlinks not preserved in bookdown PDF


I have several html-formatted URLs in my bookdown .Rmd files that disappear in the generated PDF. It appears that the link is being ignored and the PDF only displays the text that should connect the link.

For example, <a href="https://www.cygwin.com" target="_blank">Cygwin</a> simply appears as Cygwin (no hyperlink).

But when the website matches the displayed text, then it works fine (e.g.: <a href="https://www.cygwin.com" target="_blank">https://www.cygwin.com</a>), presumably because the text is the link itself.

Is there a way to have bookdown preserve these html hyperlinks in the PDF output?

I am running the following to generate the PDF in R Studio:

    render_book("index.Rmd", "bookdown::pdf_book")

And the top of index.Rmd looks like this:

    title: "My Title"
    site: bookdown::bookdown_site
    documentclass: book
    link-citations: yes
    output:
      bookdown::pdf_book:
        pandoc_args: [--wrap=none]
    urlcolor: blue

Solution

  • Pandoc, and in extension R Markdown, just keeps the raw HTML of the links around. The raw HTML chunks are output to formats supporting HTML (like epub), but not for LaTeX (which is used to generate the PDF). Pandoc will just parse the link's content, which is the reason why it seems to work if your link text is a URL.

    The simplest solution would of course be to use Markdown syntax for links instead, which is just as expressive as HTML: [Cygwin](https://www.cygwin.com){target="_blank"}. However, if that is not an option, then things get a bit hacky.

    Here's a method to still parse those links. It uses a Lua filter to convert the raw HTML into a proper link. Just safe the following script as parse-html-links.lua into the same directory as your Rmd file and add '--lua-filter=parse-html-links.lua' to your list of pandoc_args.

    local elements_in_link = {}
    local link_start
    local link_end
    
    Inline = function (el)
      if el.t == 'RawInline' and el.format:match'html.*' then
        if el.text:match'<a ' then
          link_start = el.text
          return {}
        end
        if el.text:match'</a' then
          link_end = el.text
          local link = pandoc.read(link_start .. link_end, 'html').blocks[1].content[1]
          link.content = elements_in_link
          -- reset
          elements_in_link, link_start, link_end = {}, nil, nil
          return link
        end
      end
      -- collect link content
      if link_start then
        table.insert(elements_in_link, el)
        return {}
      end
      -- keep original element
      return nil
    end