Search code examples
htmldombrowserxsshtml-encode

Basic encoding/decoding of characters for the web


I feel like this is something I should definitely know about, but I'm not entirely sure of the details of at what point a character is decoded by a browser (or even if I'm thinking about it in the right way).

While inspecting the DOM of a site to which I've added some content (through a form, for example), I can see my < (in the contents of my comment) appear as a string. Even if the angular brackets are well-balanced (e.g. <something>), it appears as a string rather than an element in the DOM. I appreciate this is critical in defense against injection attacks such as XSS, so (on the server), the content is written as a string literal rather than an element - but how does the browser recognise this and render it differently? And when does it decode it?

If the server does respond with &gt; or &lt; why do I not see this in dev tools?

My confusion comes from the fact that, when inspecting, there is no difference between my <something> content and a <something> element (if there were such a thing).


Solution

  • So, I'd expect to see (when inspecting the DOM) &lt;content&gt;, but it seems not.

    This is merely because your browser's DOM inspector is a bit loose in its representation. You're inspecting the DOM after all, a complex object oriented internal memory structure, yet your browser is showing it to you in an HTML-like presentation. Either because of an oversight or as a conscious decision to make this presentation more readable, not everything that should be an HTML entity in valid HTML is being displayed as HTML entity.

    If you inspect the actual source code of the page, you'll see &lt;content&gt;.