Search code examples
markdowndocxpandoc

Disable pandoc convert the image’s alt text to a paragraph when docx to markdown


I have a markdown document containing remote image links named input.md:

Using Pandoc to convert documents

![this is the image caption](https://i.sstatic.net/sk5Pr.png)

Pandoc is really *awesome*!

And with the command below to convert it to docx:

pandoc input.md -o output.docx

After getting output.docx, then converted it to markdown again:

pandoc output.docx --extract-media=. -t commonmark-raw_html -o output.md

Here the option ommonmark-raw_html was applied to disable the image size’s tag, and the converted content of output.md:

Using Pandoc to convert documents

![this is the image caption](./media/rId20.png)

this is the image caption

Pandoc is really *awesome*!

You can see the image’s alt text this is the image caption was displayed twice. The first one was the actual image’s alt text, and second was a paragraph below. But I would like to remove the paragraph of the image’s alt text, which was redundant.

Why did I converted markdown twice here? Because I wanted to replace the remote image link with the local image link in markdown.

It would be great if you have any suggestion to resolve the issue. Thanks in advance!


Solution

  • The easiest way is possibly to convert directly to Markdown while using the --extract-media option:

    pandoc input.md --extract-media=media -t commonmark -o output.md
    

    That option can be used with any input format, it's just not as common. But this is one of the use-cases where it makes sense.