Search code examples
c++htmlhtml-parsingmshtmlmicrosoft.mshtml

In HTML parsing Get Attributes of a tag in Cpp using IHTMLDOMAttribute


please help i am doing a html parsing using MSHTML. My code for getting all attributes of a particular tag is like this

void GetAttributes(MSHTML::IHTMLElementPtr pColumnInnerElement)
{
    IHTMLDOMNode *pElemDN = NULL;
    LONG lACLength;
    MSHTML::IHTMLAttributeCollection *pAttrColl;
    IDispatch* pACDisp;
    VARIANT vACIndex;
    IDispatch* pItemDisp;
    IHTMLDOMAttribute* pItem;
    BSTR bstrName;
    VARIANT vValue;
    VARIANT_BOOL vbSpecified;
    pColumnInnerElement->QueryInterface(IID_IHTMLDOMNode, (void**)&pElemDN);
    if (pElemDN != NULL)
    {
        pElemDN->get_attributes(&pACDisp);
        pACDisp->QueryInterface(IID_IHTMLAttributeCollection, (void**)&pAttrColl);
        pAttrColl->get_length(&lACLength);
        vACIndex.vt = VT_I4;
        for (int i = 0; i < lACLength; i++)
        {

            vACIndex.lVal = i;
            pItemDisp = pAttrColl->item(&vACIndex);
            if (pItemDisp != NULL)
            {
               pItemDisp->QueryInterface(IID_IHTMLDOMAttribute, (void**)&pItem);
               pItem->get_specified(&vbSpecified);
               pItem->get_nodeName(&bstrName);
               pItem->get_nodeValue(&vValue);

               if (vbSpecified)
                cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
               pItem->Release();
            }
            pItemDisp->Release();

        }
        pElemDN->Release();
        pACDisp->Release();
        pAttrColl->Release();
    }
}

The problem is for given tag <input id="Switch l_id2" class="pointer" name="Switch" onclick='SetControl("Switch l",1)' type="button" value="OK"> it prints all attributes except value attribute. The get_specified function is returning false for value attribute.

My output is

id :Switch l_id2
class :pointer
onclick :SetControl("Switch l",1)
type :button
name :Switch

Any idea why? Also which other attributes may have this problem??

Note

I tried like this. Its showing the correct attribute results for value.

        if (strcmp(_com_util::ConvertBSTRToString(bstrName), "value") == 0)
        {
            cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
        }

Solution

  • If you are working in managed(CLI) VC++ then you can consider the HTML Agility Pack, available via nuget.

    If sticking to MSHTML is not necessary then probably you can opt for parsing the HTML documents as XML documents. That way you would be able to parse all the tags and attributes with a lot of flexibility. There are plenty of XML parsers available for C++.

    This library looks compact simple and efficient (available for multiple platforms): https://github.com/leethomason/tinyxml2

    Another one is: http://pugixml.org/

    This link may help you if you want to get rid of MSHTML dependency: http://www.codeproject.com/Articles/30342/Remove-Microsoft-mshtml-dependency