Search code examples
c#.netregexstring-parsing

How to extract string between 2 markers using Regex in .NET?


I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.

I've tried the following with no success:

var match = Regex.Match(output, @"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");

It finds a string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.

What am i missing?


Solution

  • I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.

    The latest version even supports Linq so you can get your content like this:

    HtmlWeb web = new HtmlWeb();
    HtmlDocument doc = web.Load("http://stackoverflow.com");
    string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;