Search code examples
luapandoc

Lua filter to preserve HTML comments


I try to create a Lua filter to preserve HTML comments (but not any other HTML elements).

local function starts_with(start, str)
  return str:sub(1, #start) == start
end

function RawInline(el)
  if starts_with('<!--', el.text) then
    return el
  else
    return nil
  end
end

return {{Inline = RawInline}}

(Based on mb21's answer here: From HTML to Markdwon: As clean Markdown markup as possible, and to preserve HTML comments.)

It doesn't currently work. What might be the problem?

pandoc -f html+raw_html from.html -o to.md -t gfm --lua-filter preserve-comments.lua

Solution

  • There are two small problems that prevent this filter from working. I'm listing them below and include explanations and solutions for each.

    1. The main issue is return {{Inline = RawInline}}. This causes the RawInline function to be called for all Inline elements, such as Str, Emph, Space, etc. This is causing issues, because some elements don't have a .text attribute, and calling starts_with with nil as the second argument triggers an error.

      The solution for this is to either use return {{RawInline = RawInline}}, or to leave the line out entirely. Both solutions are equivalent due to the way pandoc constructs filters from global functions if no explicit filter table is returned.

    2. The RawInline function does nothing, because return el and return nil do the same thing in this case. Not returning anything from a filter function causes pandoc to keep the object unaltered. Deleting an object is possible by returning {}.

    To summarize, this should work:

    local function starts_with(start, str)
      return str:sub(1, #start) == start
    end
    
    function RawInline(el)
      if not starts_with('<!--', el.text) then
        return {}
      end
    end
    

    To make ensure that no HTML at all is included in the output, we can use gfm-raw_html as the output format, i.e., we disable the raw_html extension. This will also suppress any HTML comment, so we modify the filter to pretend that these comments are raw Markdown, which will be included verbatim.

    local function starts_with(start, str)
      return str:sub(1, #start) == start
    end
    
    function RawInline (el)
      return starts_with('<!--', el.text)
        and pandoc.RawInline('markdown', el.text) -- pretend it's md
        or {}  -- not an HTML comment, thus drop it
    end