Here is my HTML code.
<!DOCTYPE html>
<html>
<meta charset="UTF-8">
<head>
<title>Bar</title>
<script>
window.onload = function() {
console.log(document.body.innerHTML)
}
</script>
</head>
<body>
<http://www.example.com/foo/bar/baz.html>
</body>
</html>
I save this code in a file named bar.html
and then open the page with Firefox or Chrome. This is the output I see in the console.
<http: www.example.com="" foo="" bar="" baz.html="">
</http:>
Now I understand that my code was incorrect because it had a URL enclosed within <
and >
.
I want to understand how exactly did the browser parse it as an http:
tag with parts of the URL interpreted as HTML attributes.
Is there some part of the HTML specification that leads to this kind of behavior? If so, could you please quote such parts of the HTML specification?
Everything you need to know is in section 8.2.4. In particular:
Up to <http:
, the parser is in the tag name state. The element's tag name is http:
, including the colon, as evidenced by the </http:>
end tag.
The first /
switches the parser to the self-closing start tag state.
The second /
causes a parse error as described in the link in step 2, switching the parser to the before attribute name state.
The parser enters the attribute name state and continues consuming the URL. This is what causes paths of the path to be treated as attribute names.
When the parser reaches the next /
, it switches back to the self-closing start tag state and repeats steps 2 and 3, except that it's not a second /
but a different character (that isn't >
) that causes the parse error and switches the parser back to the before attribute name state in step 3.
Once the parser finally sees a >
, it closes the start tag, emits it, and proceeds as normal.