Search code examples
c#html

Essential Objects WebView how to navigate through the HTML-tree?


I am using the Essential objects library to read out websites.

I've done that before with windows forms webbrowser, but this time the website is not working with windows forms webbrowser so I had to change to EO webView.

The documentary is so poor, that I can't find an answer.

In windows forms webbrowser you have a HtmlElementCollection which is in principle a list of HtmlElement. On these elements you can read out attributes or make an InvokeMember("Click") and navigate through children / parent elements.

what is the equivalent in EO WebView to this HtmlElementCollection / HtmlElement? How can I navigate through the HTML tree?

BTW: I am using it together with C#.


Solution

  • See the documentation: here, here, here.

    Essentially, you have to rely on the ability to execute JavaScript.

    You can access the document JavaScript object in a couple of ways:

    JSObject document = (JSObject)_webView.EvalScript("document");
    
    //or: Document document = _webView.GetDOMWindow().document;   
    

    GetDOMWindow() returns a EO.WebBrowser.DOM.Document instance; that type derives from JSObject and offers some extra properties (e.g., there's a body property that gets you the BODY element of type EO.WebBrowser.DOM.Element).
    But overall, the API these offer is not much richer.

    You can use JSObject like this:

    // access a property on the JavaScript object:
    jsObj["children"]    
    
    // access an element of an array-like JavaScript object:
    var children = (JSObject)jsObj["children"];
    var first = (JSObject)children[0];
    
    // (note that you have to cast; all these have the `object` return type)
    
    // access an attribute on the associated DOM element
    jsObj.InvokeFunction("getAttribute", "class")
    
    // etc.
    

    It's all a bit fiddly, but you can write some extension methods to make your life easier (however, see the note on performance below):

    public static class JSObjectExtensions
    {
        public static string GetTagName(this JSObject jsObj)
        {
            return (jsObj["tagName"] as string ?? string.Empty).ToUpper();
        }
    
        public static string GetID(this JSObject jsObj)
        {
            return jsObj["id"] as string ?? string.Empty;
        }
    
        public static string GetAttribute(this JSObject jsObj, string attribute)
        {
            return jsObj.InvokeFunction("getAttribute", attribute) as string ?? string.Empty;
        }
    
        public static JSObject GetParent(this JSObject jsObj)
        {
            return jsObj["parentElement"] as JSObject;
        }
    
        public static IEnumerable<JSObject> GetChildren(this JSObject jsObj)
        {
            var childrenCollection = (JSObject)jsObj["children"];
            int childObjectCount = (int)childrenCollection["length"];
            for (int i = 0; i < childObjectCount; i++)
            {
                yield return (JSObject)childrenCollection[i];
            }
        }
    
        // Add a few more if necessary
    }
    

    Then you can do something like this:

    private void TraverseElementTree(JSObject root, Action<JSObject> action)
    {
        action(root);
        foreach(var child in root.GetChildren())
            TraverseElementTree(child, action);
    }
    

    Here's an example of how you could use this method:

    TraverseElementTree(document, (currentElement) =>
    {
        string tagName = currentElement.GetTagName();
        string id = currentElement.GetID();
        if (tagName == "TD" && id.StartsWith("codetab"))
        {
            string elementClass = currentElement.GetAttribute("class");
            // do something...
        } 
    });
    

    But, again, it's a bit fiddly - while this seems to work reasonably well, you'll need to experiment a bit to find any tricky parts that can result in errors, and figure out how to modify the approach to make it more stable.

    Note on performance

    Another alternative is to use JavaScript for most of the element processing, and just return the values you need to be used in your C# code. Depending on how complex the logic is, this is likely going to be more efficient in certain scenarios, as it would result in a single browser engine round trip, so it's something to consider if performance becomes an issue. (See the Performance section here.)