Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application
You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
<br>
, <meta>
, <link>
, and <img>
(also known as void elements)<p>
<dt>
<li>
(their end tags can be implied)<
" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>
, <title>Using the "<" operator</title>
<meta
charset=utf-8
>
<input
disabled
>
XML parsers will fail to parse any HTML document that uses any of those features.
HTML parsers, on the other hand, will basically never fail no matter what a document contains.
All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
The intended use is to make an HTML parser, that is part of a web crawler application
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.: