I am successfully extracting the file names from all href tags in the html below and add it to a list.
HTML:
<ul class="resourcelist">
<li><a href="/upload/Article/07.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>SPEC SHEET: </strong> d07</a></li>
<li><a href="/upload/Article/73.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>ASSEMBLY SHEET: </strong> d73</a></li>
<li><a href="/upload/Article/75.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>ASSEMBLY SHEET: </strong> d75</a></li>
<li><a href="/upload/Article/71.pdf" target="_blank"><img src="/assets/images/pdf.png" /> <strong>INSTALLATION SHEET: </strong> d71</a></li>
</ul>
C# code to parse the html:
public List<string> LinksList = new List<string>();
public List<string> GetLinks()
{
var doc = new HtmlDocument();
doc.LoadHtml(GetProductDescription("TechnicalSpecifications"));
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var node in nodes)
{
var href = node.Attributes["href"].Value.Split('/')[3];
if (!LinksList.Contains(href))
{
LinksList.Add(href);
}
}
return LinksList;
}
Is there any possible way to target everything from the beginning of <strong>
+ the text before closing a tag? (basically everything that is not in < ... >)
I have looked over tons of questions on SO nothing seems to be the answer for this.
Output example:
SPEC SHEET: d07
Thanks in advance.
You're effectively just collecting the inner text of the nodes. Do this:
var texts = doc.DocumentNode
.SelectNodes("//a[@href]")
.Select(n => n.InnerText)
.Distinct()
.ToList();