I have a source to a web page and I need to extract the body. So anything between </head><body>
and </body></html>
.
I've tried the following with no success:
var match = Regex.Match(output, @"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a string but cuts it off long before </body></html>
. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;