Search code examples
htmljupyter-notebookms-wordpandoc

Converting HTML with equations pages to docx


I am trying to convert an html document to docx using pandoc.

pandoc -s Template.html --mathjax -o Test.docx 

During the conversion to docx everything goes smooth less the equations. In the html file the equation look like this:

<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
\begin{equation}
\log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391}
\end{equation}
</div>
</div>
</div>
</div>

After running the pandoc command the result in the docx document is:

\begin{equation} \log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391} \end{equation}

Do you have idea how can I overcome this issue?


Solution

  • A Lua filter can help here. The code below looks for div elements with a data-mime-type="text/markdown" attribute and, somewhat paradoxically, parses it context as LaTeX. The original div is then replaced with the parse result.

    local stringify = pandoc.utils.stringify
    function Div (div)
      if div.attributes['mime-type'] == 'text/markdown' then
        return pandoc.read(stringify(div), 'latex').blocks
      end
    end
    

    Save the code to a file parse-math.lua and let pandoc use it with the --lua-filter / -L option:

    pandoc --lua-filter parse-math.lua ...
    

    As noted in a comment, this gets slightly more complicated if there are other HTML elements with the text/markdown media type. In that case we'll check if the parse result contains only math, and keep the original content otherwise.

    local stringify = pandoc.utils.stringify
    function Div (div)
      if div.attributes['mime-type'] == 'text/markdown' then
        local result = pandoc.read(stringify(div), 'latex').blocks
        local first = result[1] and result[1].content or {}
        return (#first == 1 and first[1].t == 'Math')
          and result
          or nil
      end
    end