Search code examples
c++delphidommshtml

Removing invalid URLs from MSHTML document before it is loaded


I use MSHTML (IHTMLDocument) to display offline HTML which may contain various links. These are loaded from email HTML.

Some of them have URLs starting with // or / for example:

<img src="//www.example.com/image.jpg">

<img src="/www.example.com/image.jpg">

This takes a lot of time to resolve and show the document because it cannot find the URL obviously as it doesn't start with http:// or https://

I tried injecting <base> tag into <head> and adding a local known folder (which is empty) and that stopped this problem. For example:

<base href="C:\myemptypath\">

However, if links begin with \\ (UNC path) the same problem and long loading time begin again. Like:

<img src="\\www.something.com\image.jpg">

I also tried placing WebBrowser control into "offline" mode and all other tricks I could think of and couldn't come up with anything short of RegEx and replacing all the links in the HTML which would be terribly slow solution (or parsing HTML myself which defeats the purpose of MSHTML).

Is there a way to:

  • Detect these invalid URLs before the document is loaded? - Note: I already did navigate through DOM e.g. WebBrowser1.Document.body.all collection, to get all possible links from all tags and modify them and that works, but it only happens after the document is already loaded so the long waiting time before loading gives up is still happening

  • Maybe trigger some event to avoid loading these invalid links and simply replace them with about:blank or empty "" text like some sort of "OnURLPreview" event which I could inspect and reject loading of URLs that are invalid? There is only OnDownloadBegin event which is not it.

Any examples in any language are welcome although I use Delphi and C++ (C++ Builder) as I only need the principle here in what direction to look at.


Solution

  • After a long time this is the solution I used:

    Created an instance of CLSID_HTMLDocument to parse HTML:

    DelphiInterface<IHTMLDocument2> diDoc;
    OleCheck(CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&diDoc)));
    

    Write IHTMLDocument2 to diDoc using Document->write

    // Creates a new one-dimensional array
    WideString HTML = "Example HTML here...";
    SAFEARRAY *psaStrings = SafeArrayCreateVector(VT_VARIANT,0,1);
    
    if (psaStrings)
        {
        VARIANT *param;
        BSTR bstr = SysAllocString(HTML.c_bstr());
        SafeArrayAccessData(psaStrings, (LPVOID*)&param);
        param->vt      = VT_BSTR;
        param->bstrVal = bstr;
        SafeArrayUnaccessData(psaStrings);
        diDoc->write(psaStrings);
        diDoc->close();
    
        // SafeArrayDestroy calls SysFreeString for each BSTR
        //SysFreeString(bstr);  // SafeArrayDestroy should be enough
        SafeArrayDestroy(psaStrings);
    
        return S_OK;
        }
    

    Parse unwanted links in diDoc

    DelphiInterface<IHTMLElementCollection> diCol;
    if (SUCCEEDED(diDoc->get_all(&diCol)) && diCol)
        {
        // Parse IHTMLElementCollection here...
        }
    

    Extract parsed HTML into WideString and write into TWebBrowser

    DelphiInterface<IHTMLElement> diBODY;
    OleCheck(diDoc->get_body(&diBODY));
    
    if (diBODY)
        {
        DelphiInterface<IHTMLElement> diHTML;
        OleCheck(diBODY->get_parentElement(&diHTML));
    
        if (diHTML)
            {
            WideString wsHTML;
            OleCheck(diHTML->get_outerHTML(&wsHTML));
    
            // And finally use the `Document->write` like above to write into your final TWebBrowser document here...
            }
        }