Search code examples
c#asp.nethtml-parsinghtml-agility-pack

Get href tag inner text from html (html agility pack)


I am successfully extracting the file names from all href tags in the html below and add it to a list.

HTML:

<ul class="resourcelist">
    <li><a href="/upload/Article/07.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>SPEC SHEET: </strong> d07</a></li>
    <li><a href="/upload/Article/73.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>ASSEMBLY SHEET: </strong> d73</a></li>
    <li><a href="/upload/Article/75.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>ASSEMBLY SHEET: </strong> d75</a></li>
    <li><a href="/upload/Article/71.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>INSTALLATION SHEET: </strong> d71</a></li>
</ul>

C# code to parse the html:

    public List<string> LinksList = new List<string>();
    public List<string> GetLinks()
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(GetProductDescription("TechnicalSpecifications"));
            HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@href]");
            foreach (var node in nodes)
            {
                var href = node.Attributes["href"].Value.Split('/')[3];
                if (!LinksList.Contains(href))
                {
                    LinksList.Add(href);
                }

            }
            return LinksList;
        }

Is there any possible way to target everything from the beginning of <strong> + the text before closing a tag? (basically everything that is not in < ... >)

I have looked over tons of questions on SO nothing seems to be the answer for this.

Output example:

SPEC SHEET: d07

Thanks in advance.


Solution

  • You're effectively just collecting the inner text of the nodes. Do this:

    var texts = doc.DocumentNode
        .SelectNodes("//a[@href]")
        .Select(n => n.InnerText)
        .Distinct()
        .ToList();