Search code examples
windowsmshtmlihtmldocument2

MHTML's IHTMLDocument: incorrect charset after load from URL


A web page is loaded from the Internet in a windowless IHTMLDocument for future tweaking of the DOM. Everything is fine, except the charset is wrong: regardless of the charset advertised in the webpage in the META section, the charset property of the IHTMLDocument always turns out "Windows-1251" immediately after the document is loaded.

When I later write the modified document out, the file is unreadable due to encoding mismatch: the text is in original encoding, whereas the META charset tag in the new document is "Windows-1251".

Here's the code I use to load the document (error handling and cleanup omitted).

    IHTMLDocument2* pDoc = NULL;
    CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, 
            IID_IHTMLDocument2, (void**)&pDoc);

    IMoniker* pIMoniker = NULL;
    CreateURLMonikerEx(NULL, path.c_str(), &pIMoniker, URL_MK_UNIFORM);

    IPersistMoniker* pPMk= NULL;
    pDoc->QueryInterface(IID_IPersistMoniker, (void **)&pPMk);

    IBindCtx *pBCtx = NULL;
    CreateBindCtx(0, &pBCtx);

    pPMk->Load(FALSE, pIMoniker, pBCtx, STGM_READ|STGM_SHARE_EXCLUSIVE);

Why is the encoding wrong, and how do I make it right? Thanks.


Solution

  • Problem resolved by putting the IHTMLDocument in design mode before loading:

    hr = pDoc->put_designMode(L"On");   
    

    The encoding is right after this modification. (But why?..)