Search code examples
ruby-on-railsrubyhtml-parsingnokogiri

Nokogiri raises error when parsing html with <meta charset='UTF-8>


Nokogiri::HTML raises error when I add <meta charset='UTF-8> to a .html file.

The file is the following one:

<!DOCTYPE html> 
<html>
  <head>
    <meta charset="UTF-8">
    <title>My super content</title>
    <link rel="stylesheet" type="text/css" href="./static.css">
  </head>

  <body>
    <footer>
      <p></p>
    </footer>
    <script type="text/javascript" src="./static.js"></script>
  </body>

</html>

When I parse it I get:

$ doc = Nokogiri::HTML(open('myfile.html'))
$ doc.errors
> [#<Nokogiri::XML::SyntaxError: 10:12: ERROR: Tag footer invalid>]

Removing <meta charset="UTF-8"> fix the problem.

Why? And how can I make it works with it?


Solution

  • Nokogiri is primarily an XML parser and thus expects mostly valid XML. Although HTML looks a lot like XML, especially with HTML 5, there are different rules about e.g. closing tags and algorithms to detect things such as encoding which makes HTML 5 incompatible to XML and XML parsers.

    In an issue related to your problem, the response of Mike Dalessio (one of the nokogiri maintainers) was accordingly:

    Nokogiri does not support HTML5. You may want to check out the Nokogumbo project, which aims for HTML5 compatibility with the Gumbo parser.