Search code examples
c#html-agility-pack

How to extract specific link in c#?


I'm using the HtmlAgilitypack to extract some data from the following website:

 <div class="pull-right">
          <ul class="list-inline">
            <li class="social">
              <a target="_blank" href="https://www.facebook.com/wsat.a?ref=ts&amp;fref=ts" class="">
                <i class="icon fa fa-facebook" aria-hidden="true"></i>
              </a>
            </li>
            <li class="social">
              <a target="_blank" href="https://twitter.com/wsat_News" class="">
                <i class="icon fa fa-twitter" aria-hidden="true"></i>
              </a>
            </li>
            <li>
                <a href="/user" class="hide">
                <i class=" icon fa fa-user" aria-hidden="true"></i>
              </a>
            </li>
            <li>
              <a onclick="ga('send', 'event', 'PDF', 'Download', '');" href="https://wsat.com/pdf/issue15170/index.html" target="_blank" class="">

                PDF
                <i class="icon fa fa-file-pdf-o" aria-hidden="true"></i>
              </a>
            </li>

I've managed to write this code to extract the first link in the html script which is https://www.facebook.com/wsat. However, all I want is to extract the link with the pdf which is https://wsat.com/pdf/issue15170/index.html but without any luck. How do I specify which link to extract ?

        var url = "https://wsat.com/";
        var HttpClient = new HttpClient();
        var html = await HttpClient.GetStringAsync(url);
        var htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(html);


        var links = htmlDocument.DocumentNode.Descendants("div").Where(node => node.GetAttributeValue("class", "").Equals("pull-right")).ToList();

        var alink = links.First().Descendants("a").FirstOrDefault().ChildAttributes("href")?.FirstOrDefault().Value;

        await Launcher.OpenAsync(alink);

Solution

  • Use an xpath expression as a selector:

    var alink = htmlDocument.DocumentNode
        .SelectSingleNode("//li/a[contains(@onclick, 'PDF')]")
        .GetAttributeValue("href", "");
    

    Explanation of xpath (as requested):

    Match li tag at any depth in the document with an immediate child a tag, which has an attribute onclick that contains the string 'PDF'.