Search code examples
markdownpandoc

From HTML to Markdwon: As clean Markdown markup as possible, and to preserve HTML comments


Here is my HTML file, which I want to convert to Markdown. Note the first line is a comment, which I want to preserve.

<!-- https://fs.blog/feynman-technique/ -->
<h1 class="entry-title entry-title-single">The Feynman Technique: Master the Art of Learning</h1>
<div class="entry-content entry-content-single">
<p>The Feynman Technique is the most effective method to unlock your potential and develop a deep understanding. </p>
<p><a href="https://fs.blog/intellectual-giants/richard-feynman/">Richard Feynman</a> was not only a Nobel laureate in Physics but also a master of demystifying complex topics. His key learning insight: complexity and jargon often mask a lack of understanding. </p>
<p>Feynman&#8217;s learning technique comprises four key steps:</p>
<ol>
<li>Select a concept to learn.</li>
<li>Teach it to a child.</li>
<li>Review and refine your understanding.</li>
<li>Organize your notes and revisit them regularly.</li>
</ol>
<p>...</p>
<div class="wp-block-image">
<figure class="aligncenter"><img fetchpriority="high" decoding="async" width="1920" height="1080" src="https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique.jpg" alt="" class="wp-image-43131" srcset="https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique-300x169.jpg 300w , https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique-768x432.jpg 768w , https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique-1024x576.jpg 1024w , https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique-1536x864.jpg 1536w , https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique.jpg 1920w " sizes="(max-width: 1920px) 100vw, 1920px" /></figure></div>
<figure class="wp-block-pullquote"><blockquote><p>The person who says he knows what he thinks but cannot express it usually does not know what he thinks.</p><cite>Mortimer Adler</cite></blockquote></figure>
<h2 class="wp-block-heading">Step 1: Select a concept to learn.</h2>
<p>...</p>

My current solution is

pandoc from.htm -o to.md -t gfm-raw_html --wrap=none

and it gives me really neat markup, without any garbage,

# The Feynman Technique: Master the Art of Learning

The Feynman Technique is the most effective method to unlock your potential and develop a deep understanding.

[Richard Feynman](https://fs.blog/intellectual-giants/richard-feynman/) was not only a Nobel laureate in Physics but also a master of demystifying complex topics. His key learning insight: complexity and jargon often mask a lack of understanding.

Feynman’s learning technique comprises four key steps:

1.  Select a concept to learn.
2.  Teach it to a child.
3.  Review and refine your understanding.
4.  Organize your notes and revisit them regularly.

...

![](https://149664534.v2.pressablecdn.com/wp-content/uploads/2012/04/FeynmanTechnique.jpg)

> The person who says he knows what he thinks but cannot express it usually does not know what he thinks.
>
> Mortimer Adler

## Step 1: Select a concept to learn.

...

but the problem is that it doesn't preserve HTML comments. Is there a way to fix this issue?


Solution

  • Your problem is that you're disabling raw_html with -t gfm-raw_html. The following preserves raw HTML (including comments, which are represented just as raw HTML in the pandoc document AST):

    pandoc -f html+raw_html -t gfm
    

    Depending on what you want to achieve, it's possibly that you need to write a pandoc lua filter to remove the raw HTML snippets that are not comments. Something like the following (untested):

    function RawInline(el)
      return nil
    end
    
    function RawBlock(el)
      if starts_with('<!--', el.text) then
        return el
      else
       return nil
      end
    end
    

    But try -t native to inspect the document AST between the reader and writer.