Search code examples
scriptingserver-sidepandoc

Simple way to remove tags (not content) matching a CSS selector?


Is there a simple approach to process an HTML file so that tags matching a certain CSS selector can be deleted? My motivation is that pandoc generates HTML output that in my view is too verbose, surrounding any math expression with <span class="math inline"> ... </span>, when generally ... is enough. For display math the input and output tend to have line breaks, so maybe a dedicated tool would be better than grep or similar. The goal is to reduce bandwidth usage, so anything client-side would be out.


Solution

  • Pandoc inserts those span tags to enable javascript libraries like mathjax to display the math properly... you can of course remove them with your html processing tool of choice, e.g. Nokogiri if you're using ruby, Put something like this in removespans.rb:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(File.open("file.html"))
    doc.search('span').remove
    puts doc
    

    then execute:

    pandoc -s -o file.html input.md
    ruby removespans.rb > output.html