Search code examples
javascriptecmascript-6jquery-selectorscreateelement

document.createElement without loading in DOM


Is there a way to create a parsable Dom without running the code? I'll explain future;

I receive a whole bunch of CK-editor created code as a HTML, but want to parse elements from this to create a specified view. For example I'll like to grasp the first paragraph as an Intro and the first Image as a primary image. I'n addition I want to retrieve all images to create a gallery.

For doing this I've created a simple, but effective function:

export const getFromContent = (html, qsa) => {
    const elm = document.createElement("DIV");
    elm.innerHTML = html;
    let r = elm.querySelectorAll(qsa);
    return r;
}

This work almost perfect- the only issue is that is adds everything to the DOM (do I use the term correctly?), which means that all resources gets loaded even if it does not show on the Page.

In my example I would like to load all images through //res.cloudinary.com/ com compress images before shown, but since all images already is loaded, this wont be necessary.

Is there a way to get to keep the good "version" of this with just basic JS?

PS: I know that I could rewrite all "src" to "presrc" with a REGEX, but I would really like to do this without changing the code, and thereby create space for errors.

Best regards Richard


Solution

  • This work almost perfect- the only issue is that is adds everything to the DOM...

    It creates DOM elements (that's why you're doing it!😀), but it doesn't add them to the window's document. Beware doing that can run code that's in the HTML, though (details below). Doing that:

    • Won't put the div or its contents anywhere in the document.
    • Won't load any stylesheets defined by link tags in the HTML.
    • Won't fetch any script files referenced by script src="xyz" tags in the HTML (and thus won't run the code).
    • Won't run any code in inline script tags in the HTML.
    • Will add event handlers defined via onXyz attributes on elements in the HTML.
    • Will load any images defined in the HTML.

    It's the combination of those last two points that means it can run arbitrary code, like this:

    const getFromContent = (html, qsa) => {
        const elm = document.createElement("DIV");
        elm.innerHTML = html;
        let r = elm.querySelectorAll(qsa);
        return r;
    };
    
    getFromContent(
        `<img
            src="http://example.com/alksdjflsadkf"
            onload="console.log('Arbitrary code ran!');"
            onerror="console.log('Arbitrary code ran!');"
        >`,
        "p"
    );

    If you later add that div to the document, that will load any stylesheets defined by link tags in the HTML but won't run code in script elements (either inline or referenced via src).

    That said, you might want to look at using DOMParser instead.

    export const getFromContent = (html, selector) => {
        const parser = new DOMParser();
        const dom = parser.parseFromString(html, "text/html");
        let r = dom.querySelectorAll(selector);
        return r;
    };
    

    It won't even load images defined in the HTML, and although it adds handlers defined via inline onXyz attributes, it doesn't fire any events, so those handlers won't be run — until or unless you add the resulting document's contents to an active document. It just parses the tree and returns a document.

    Live Example:

    const getFromContent = (html, selector) => {
        const parser = new DOMParser();
        const dom = parser.parseFromString(html, "text/html");
        let r = dom.querySelectorAll(selector);
        return r;
    };
    
    const content = `
        <p>Paragraph 1</p>
        <p>
            Paragraph 2
            <img
                src="http://example.com/alksdjflsadkf"
                onload="console.log('Arbitrary code ran!');"
                onerror="console.log('Arbitrary code ran!');"
            >
        </p>
        <div>Div 1</div>
        <p>Paragraph 3</p>
    `;
    const paragraphs = getFromContent(content, "p");
    console.log(paragraphs.length);
    for (const paragraph of paragraphs) {
        console.log(paragraph.textContent);
    }

    Note that the img's inline handlers were never fired. They would be if you added the img elements (or their ancestors) to an active document:


    Note: When accepting user input and rendering it as HTML, it's often important to sanitize that input before using it, to remove unwanted content. For instance, I mentioned that script elements wouldn't be evaluated, and that's true, but if the content had <img src="javascript:doSomethingNefarious()"> in it, and you appended that image to a document (directly or indirectly), that doSomethingNefarious() code would be executed. Similarly, <div onclick="doSomethingNefarious()">x</div>.

    If you search for "HTML sanitizer" you'll find a lot of different libraries out there that say they'll do it for you. The problem is significant enough, though, that a means of doing it is in the process of being standardized as the Sanitization API. Early days yet, but it's a very promising development. With the API in its current (very draft) form, you could do:

    export const getFromContent = (html, selector) => {
        const div = document.createElement("div");
        div.setHTML(html);  // <== `setHTML` is a new method that sanitizes.
                            // Here I'm using the default sanitizer, but you
                            // could create one with your own custom settings
                            // and pass it as the second argument
        let r = div.querySelectorAll(selector);
        return r;
    };
    

    But the API is still in flux.