Search code examples
rubynokogiritidy

Cleaning HTML with Nokogiri (instead of Tidy)


The tidy gem is no longer maintained and has multiple memory leak issues.

Some people suggested using Nokogiri.

I'm currently cleaning the HTML using:

Nokogiri::HTML::DocumentFragment.parse(html).to_html

I've got two issues though:

  • Nokogiri removes the DOCTYPE

  • Is there an easy way to force the cleaned HTML to have a html and body tag?


Solution

  • If you are processing a full document, you want:

    Nokogiri::HTML(html).to_html
    

    That will force html and body tags, and introduce or preserve the DOCTYPE:

    puts Nokogiri::HTML('<p>Hi!</p>').to_html
    #=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
    #=>  "http://www.w3.org/TR/REC-html40/loose.dtd">
    #=> <html><body><p>Hi!</p></body></html>
    
    puts Nokogiri::HTML('<!DOCTYPE html><p>Hi!</p>').to_html
    #=> <!DOCTYPE html>
    #=> <html><body><p>Hi!</p></body></html>
    

    Note that the output is not guaranteed to be syntactically valid. For example, if I provide a broken document that lies and claims that it is HTML4.01 strict, Nokogiri will output a document with that DOCTYPE but without the required <head><title>...</title></head> section:

    dtd = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
    puts Nokogiri::HTML("#{dtd}<p>Hi!</p>").to_html
    #=> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    #=>  "http://www.w3.org/TR/html4/strict.dtd">
    #=> <html><body><p>Hi!</p></body></html>