Search code examples
c#.netwinformswebbrowser-controlmshtml

WebBrowser HtmlElement.GetAttribute("href") prepending hostname


My Windows Forms application hosts a WebBrowser control that displays a page full of links. I'm trying to find all the anchor elements in the loaded HtmlDocument and read their href attributes so I can provide a multi-file download interface in C#. Below is a simplified version of the function where I find and process the anchor elements:

public void ListAnchors(string baseUrl, HtmlDocument doc) // doc is retrieved from webBrowser.Document
{
    HtmlElementCollection anchors = doc.GetElementsByTagName("a");
    foreach (HtmlElement el in anchors)
    {
        string href = el.GetAttribute("href");
        Debug.WriteLine("el.Parent.InnerHtml = " + el.Parent.InnerHtml);
        Debug.WriteLine("el.GetAttribute(\"href\") = " + href);
    }
}

The anchor tags are all surrounded by <PRE> tags. The hostname from which I'm loading the HTML is a local machine on the network (lts930411). The source HTML for one entry looks like this:

<PRE><A href="/A/a150923a.lts">a150923a.lts</A></PRE>

The output of the above C# code for one anchor element is this:

el.Parent.InnerHtml = <A href="/A/a150923a.lts">a150923a.lts</A>

el.GetAttribute("href") = http://lts930411/A/a150923a.lts

Why is el.GetAttribute("href") adding the scheme and hostname prefix (http://lts930411) rather than returning the literal value of the href attribute from the source HTML? Is this behavior I can count on? Is this "feature" documented somewhere? (I was prepending the base URL myself, but that gave me addresses like http://lts930411http://lts930411/A/a150923a.lts. I'd be okay with just expecting the full URL if I could find documentation promising this will always happen.)


Solution

  • As stated in IHTMLAnchorElement.href documents, relative urls are resolved against the location of the document containing the a element.

    As an option to get untouched href attribute values, you can use this code:

    var expression = "href=\"(.*)\"";
    var list = document.GetElementsByTagName("a")
                       .Cast<HtmlElement>()
                       .Where(x => Regex.IsMatch(x.OuterHtml, expression))
                       .Select(x => Regex.Match(x.OuterHtml, expression).Groups[1].Value)
                       .ToList();
    

    The above code, returns untouched href attribute value of all a tags in a document.