Search code examples
pythonregexlatexpandocasciidoc

Converting adoc to markdown while preserving latex style math equations


I have a group of adoc documents that I'm converting to markdown. For most of them I've been able to convert them with:

asciidoc -b docbook -o temp.xml <infile>
pandoc -f docbook -t markdown_strict --atx-headers --mathjax temp.xml -o <outfile>

followed by some regex to clean up some broken image links and fix the headers. However, this doesn't work for the in-line math equations. In the adoc they are in the syntax: latexmath:[$some_equation_here$] sometimes without the dollar signs for multi-line equations.

when this gets turned into the DocBook XML it seems to be preserved and is of the format:

<inlineequation>
<alt><![CDATA[$some_equation_here$]]></alt>
<inlinemediaobject><textobject><phrase></phrase></textobject></inlinemediaobject>
</inlineequation>

but when pandoc converts it back to markdown it ignores these blocks of xml. How can i keep it in a markdown readable equation ($some_equation_here$) format during the pandoc conversion? The mathjax extension doesn't seem to be helping with this operation.

I tried to use a seperate python regex that would use re.sub(r'latexmath:\[\$?(.*?)\$?\]', r'$\g<1>$', file_contents to keep the $ but it results in some double escaped text that then has to go be fixed manually as well as not fully working sometimes giving some extra /sup tags. Trying to do something similar with the XML file resulted in similar results.


Solution

  • Looking at the pandoc code it seems that the DocBook reader expects the formula to be in an <mathphrase> element below <inlineequation>. Thus, replacing the <alt> tags with <mathphrase> is enough to get the equation to be picked up by pandoc. This yields invalid DocBook XML in general, as the <inlineequation> should contain either a <mathphrase> or <inlinemediaobjects>, but that doesn't matter for pandoc.

    cat << EOF | pandoc --from=docbook --to markdown --lua-filter=unwrap-math.lua
    <para>
      <inlineequation>
        <mathphrase><![CDATA[$some_equation_here$]]></mathphrase>
        <inlinemediaobject><textobject><phrase></phrase></textobject></inlinemediaobject>
      </inlineequation>
    </para>
    EOF
    $some_equation_here$
    

    Note that pandoc inserts the dollars itself, so those should be removed as well. The above command uses a Lua filter to remove the dollars; unwrap-math.lua contains

    function Math (mth)
      mth.text = mth.text:gsub('^%$', ''):gsub('%$$', '')
      return mth
    end