Search code examples

Security issue: Parsing HTML with XMLHttpRequest?

Main Question

Is parsing HTML files with XMLHttpRequest, using responseType = "document", a potential security issue?

Examples can be found on MDN here: HTML in XMLHttpRequest

When setting documentType = "document", it will try to parse the url (HTML file in our case) into DOM nodes, and retrieve that.

Let's say we have a Man-in-Middle attack situation (i.e. not using HTTPS), and the HTML file is swapped out. Are we at risk?

Bonus Question

Let's say we are loading a JSON file instead of a HTML file. Is using documentType = "text" as safe as JSON.parse, i.e. the code is not evaluated?


  • I am not a developer, but a security practitioner, so please excuse any inaccuracies. Short answer from my side is yes, when you fetch and interpret external data there will be security risks. This is not only for HTML, but also when parsing XML, or including any form of content that goes through an interpreter. For example, in AJAX the XMLHttpRequest result may perform some action on behalf of the user. If the file is swapped out, something like that could happen.

    When building an application you will not be able to eliminate all risk, but you want to bring it down to acceptable levels. For example, instead of including external code, host the code yourself.

    This applies also to your XMLHttpRequest fetch - where does the data come from? More risk comes with third parties, and across domains. Avoid if you can. You should consider blocking cross origin resource sharing by policy, though Access-Control-Allow-Origin. HTTPS does not eliminate risk either, as you possibly can not trust the third party anyway, and HTTPS does not completely eliminate MIM-attacks.

    If however you are fetching something that you are hosting yourself and have a trusted channel to obtain, you may argue that the remaining risk is small.

    As for the bonus question I am not sure whether this will make a big difference. I assume with documentType = "text" you will end up with a long string of text which is actully an HTML document. Then what? If you still plan to parse it, scripts may run. JSON.parse is a text parser, which will not load scripts, but here as far as I can understand you need to expose yourself to the parsing of HTML anyway. The solution is probably to make sure you can trust the source.