Search code examples
c#regexhref

Matching href any string between two known strings


I'm trying to match multiple hrefs in a html page and I can't seem to get it working. When I use my regex, I get no matches. How can I get multiple matches of the entire href breaking them into the two specified groups?

Sample href of many to match:

<a href="/string1/any string here/string2">text here</a>

My regex code:

MatchCollection m1 = Regex.Matches(result, @"<a\shref=""(?<url>(\/string1\/).*?(\/string2))"">(?<text>.*?)</a>", RegexOptions.Singleline);

This works, but matches hrefs I'm not interested in addition to the ones I need:

MatchCollection m1 = Regex.Matches(result, @"<a\shref=""(?<url>(\/string1\/).*?)"">(?<text>.*?)</a>", RegexOptions.Singleline);

Solution

  • As mentioned in comments, use a real html parser like HtmlAgilityPack instead of Regex

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(@"<a href=""/string1/any string here/string2"">text here</a>");
    
    var links = doc.DocumentNode
                    .SelectNodes("//a[@href]")
                    .Select(a=>a.Attributes["href"].Value)
                    .ToList();
    

    or without xpath

    var links = doc.DocumentNode
                    .Descendants("a")
                    .Where(a=>a.Attributes["href"]!=null)
                    .Select(a=>a.Attributes["href"].Value)
                    .ToList();