I'm trying to match multiple hrefs in a html page and I can't seem to get it working. When I use my regex, I get no matches. How can I get multiple matches of the entire href breaking them into the two specified groups?
Sample href of many to match:
<a href="/string1/any string here/string2">text here</a>
My regex code:
MatchCollection m1 = Regex.Matches(result, @"<a\shref=""(?<url>(\/string1\/).*?(\/string2))"">(?<text>.*?)</a>", RegexOptions.Singleline);
This works, but matches hrefs I'm not interested in addition to the ones I need:
MatchCollection m1 = Regex.Matches(result, @"<a\shref=""(?<url>(\/string1\/).*?)"">(?<text>.*?)</a>", RegexOptions.Singleline);
As mentioned in comments, use a real html parser like HtmlAgilityPack instead of Regex
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"<a href=""/string1/any string here/string2"">text here</a>");
var links = doc.DocumentNode
.SelectNodes("//a[@href]")
.Select(a=>a.Attributes["href"].Value)
.ToList();
or without xpath
var links = doc.DocumentNode
.Descendants("a")
.Where(a=>a.Attributes["href"]!=null)
.Select(a=>a.Attributes["href"].Value)
.ToList();