Nokogiri::HTML raises error when I add <meta charset='UTF-8>
to a .html file.
The file is the following one:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>My super content</title>
<link rel="stylesheet" type="text/css" href="./static.css">
</head>
<body>
<footer>
<p></p>
</footer>
<script type="text/javascript" src="./static.js"></script>
</body>
</html>
When I parse it I get:
$ doc = Nokogiri::HTML(open('myfile.html'))
$ doc.errors
> [#<Nokogiri::XML::SyntaxError: 10:12: ERROR: Tag footer invalid>]
Removing <meta charset="UTF-8">
fix the problem.
Why? And how can I make it works with it?
Nokogiri is primarily an XML parser and thus expects mostly valid XML. Although HTML looks a lot like XML, especially with HTML 5, there are different rules about e.g. closing tags and algorithms to detect things such as encoding which makes HTML 5 incompatible to XML and XML parsers.
In an issue related to your problem, the response of Mike Dalessio (one of the nokogiri maintainers) was accordingly:
Nokogiri does not support HTML5. You may want to check out the Nokogumbo project, which aims for HTML5 compatibility with the Gumbo parser.